Skip to content
Branch: master
Find file History
Latest commit 6d5148d Sep 5, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
.ipynb_checkpoints How to Code the Student’s t-Test from Scratch in Python Sep 5, 2018
A_Gentle_Intro_to_Calculating_Normal_Summary_Stats.ipynb A Gentle Introduction to Calculating Normal Summary Statistics May 19, 2018
A_Gentle_Intro_to_Chi_Squared_Test_for_ML.ipynb A Gentle Introduction to the Chi-Squared Test for Machine Learning Aug 16, 2018
A_Gentle_Intro_to_Data_Visualization_Methods_in_Python.ipynb A Gentle Introduction to Data Visualization Methods in Python Jun 6, 2018
A_Gentle_Intro_to_Estimation_Stats_for_ML.ipynb A Gentle Introduction to Estimation Statistics for Machine Learning Jun 6, 2018
A_Gentle_Intro_to_Nonparametric_Stats.ipynb A Gentle Introduction to Nonparametric Statistics May 17, 2018
A_Gentle_Intro_to_Normality_Tests_in_Python.ipynb A Gentle Introduction to Normality Tests in Python May 17, 2018
A_Gentle_Intro_to_Statistical_Data_Distributions.ipynb A Gentle Introduction to Statistical Data Distributions Aug 8, 2018
A_Gentle_Intro_to_Statistical_Hypothesis_Tests.ipynb A Gentle Introduction to Statistical Hypothesis Tests May 16, 2018
A_Gentle_Intro_to_Statistical_Sampling_and_Resampling.ipynb A Gentle Introduction to Statistical Sampling and Resampling Aug 8, 2018
A_Gentle_Intro_to_Statistical_Tolerance_Intervals_in_ML.ipynb A Gentle Introduction to Statistical Tolerance Intervals in Machine L… Jun 6, 2018
A_Gentle_Intro_to_k_fold_cross_validation.ipynb A Gentle Introduction to k-fold Cross-Validation Jun 4, 2018
A_Gentle_Intro_to_the_Bootstrap_Method.ipynb A Gentle Introduction to the Bootstrap Method Jun 4, 2018
A_Gentle_Intro_to_the_Central_Limit_Theorem_for_ML.ipynb A Gentle Introduction to the Central Limit Theorem for Machine Learning May 18, 2018
A_Gentle_Intro_to_the_Law_of_Large_Numbers_in_ML.ipynb A Gentle Introduction to the Law of Large Numbers in Machine Learning May 19, 2018
A_Gentle_Introduction_to_Effect_Size_Measures_in_Python.ipynb A Gentle Introduction to Effect Size Measures in Python Aug 28, 2018
A_Gentle_Introduction_to_Statistical_Power_and_Power_Analysis_in_Python.ipynb A Gentle Introduction to Statistical Power and Power Analysis in Python Aug 28, 2018
Confidence_Intervals_for_ML.ipynb Confidence Intervals for Machine Learning Jun 5, 2018
Critical_Values_for_Statistical_Hypothesis_Testing_in_Python.ipynb Critical Values for Statistical Hypothesis Testing and How to Calcula… Aug 8, 2018
Estimate_the_Number_of_Experiment_Repeats_for_Stochastic_Machine_Learning_Algorithms.ipynb Estimate the Number of Experiment Repeats for Stochastic Machine Lear… May 23, 2018
How_To_Generate_Random_Numbers_in_Python.ipynb How to Generate Random Numbers in Python Aug 22, 2018
How_to_Calculate_Bootstrap_Confidence_Interval_for_ML_Results_in_Python.ipynb How to Calculate Bootstrap Confidence Intervals For Machine Learning … May 21, 2018
How_to_Calculate_Nonparametric_Rank_Correlation_in_Python.ipynb How to Calculate Nonparametric Rank Correlation in Python Aug 22, 2018
How_to_Calculate_the_5_Number_Summary_for_your_data.ipynb How to Calculate the 5-Number Summary for Your Data in Python Aug 16, 2018
How_to_Report_Classifier_Performance_with_Confidence_Intervals.ipynb How to Report Classifier Performance with Confidence Intervals May 21, 2018
How_to_Transform_Data_to_Better_Fit_the_Normal_Distribution.ipynb How to Transform Data to Better Fit The Normal Distribution May 21, 2018
How_to_Use_Correlation_to_Understand_the_Relationship_Between_Variables.ipynb How to Use Correlation to Understand the Relationship Between Variables May 20, 2018
How_to_Use_Parametric_Statistical_Significance_Tests_in_Python.ipynb How to Use Parametric Statistical Significance Tests in Python May 18, 2018
How_to_Use_Statistical_Significance_Tests_to_Interpret_ML_Results.ipynb How to Use Statistical Significance Tests to Interpret Machine Learni… May 21, 2018
How_to_Use_Stats_to_Identify_Outliers_in_Data.ipynb How to Use Statistics to Identify Outliers in Data May 20, 2018
Intro_to_Nonparametric_Statistical_Significance_Tests_in_Python.ipynb Introduction to Nonparametric Statistical Significance Tests in Python May 16, 2018
Intro_to_Random_Number_Generators_for_ML_in_Python.ipynb Introduction to Random Number Generators for Machine Learning in Python May 21, 2018
Prediction_Intervals_for_ML.ipynb Prediction Intervals for Machine Learning Jun 5, 2018 Revise README Sep 5, 2018
how_to_code_t_test_from_scratch.ipynb How to Code the Student’s t-Test from Scratch in Python Sep 5, 2018
pima-indians-diabetes.csv How to Calculate Bootstrap Confidence Intervals For Machine Learning … May 21, 2018
results.csv Estimate the Number of Experiment Repeats for Stochastic Machine Lear… May 23, 2018
results1.csv How to Use Statistical Significance Tests to Interpret Machine Learni… May 21, 2018
results2.csv How to Use Statistical Significance Tests to Interpret Machine Learni… May 21, 2018

Table of Contents

Crash Course in Statistics for Machine Learning

You do not need to know statistics before you can start learning and applying machine learning. You can start today.

Nevertheless, knowing some statistics can be very helpful to understand the language used in machine learning. Knowing some statistics will eventually be required when you want to start making strong claims about your results.

In this README, you will discover a few key concepts from statistics that will give you the confidence you need to get started and make progress in machine learning.

Statistical Inference

There are processes in the real world that we would like to understand.

For example human behaviours like clicking on an add or buy a product.

They are not straightforward to understand. There are complexities and uncertainties. The process has an element of randomness to it (it is stochastic).

We understand these processes by making observation and collecting data. The data is not the process, it is a proxy for the process that gives us something to work with to understand the process.

The methods we use to make observations and collect or sample data also introduce uncertainties into the data. Together with the inherent randomness in the real-world process, we now have two sources of randomness in our data.

Given the data we have collected, we clean it up, create a model and try to say something about the process in the real world.

For example, we may make a prediction or describe the relationships between elements within the process.

This is called statistical inference. We go from a real world stochastic process, collect and model the process in data, and come back to the process in the world and say something about it.

Statistical Population

Data belongs to a population (N). A data population is all possible observations that could be made. The population is abstract, an ideal.

When you make observations or work with data, you are working with a sample of the population (n).

If you are working on a prediction problem, you are seeking to best leverage n to characterize N so that you minimize the errors in the predictions you make from other n your system will encounter.

You must be careful in your selection and handling of your sample. The size and qualities of the data will affect your ability to effectively characterize the problem, to make predictions or describe the data. The randomness (biases) introduced during the collection of must be considered and even manipulated, managed or corrected.

Big Data

The promise of big data is that you no longer need to worry about sampling data, that you can work with all the data.

That you are working with N and not n. This is false and dangerous thinking.

You are still working with a sample. You can see how this is the case. For example if you are modeling customer data in a SaaS business, you are working with a sample of the population that found and signed up for the service prior to your modeling. Those caveats bias the data you are working with.

You must be careful to not over generalize your findings, to be cautious about claims beyond that data you have observed. For example, the trends of all users of twitter do not represent the trends of all humans.

In the other direction, big data allows you model each individual entities, such as one customer (n=1), using all data collected on that entity to date. This is a powerful, exciting, and computationally demanding frontier.

Statistical Models

The world is complicated and we need to simplify it with assumptions in order to understand it.

A model is a simplification of a process in the real world. It will always be wrong, but it might be useful.

A statistical model describes the relationship between data attributes, such as a dependent variable with independent variables.

You can think about your data before hand and propose a model that describes relationships between the data.

You can also run machine learning algorithms that assume a type of model of a specific form will describe the relationship and find the parameters to fit the model to the data. This is where notions of a fit, overfitting and underfitting come from, where the model is too specific or not specific enough in its ability to generalize beyond observed data.

Simpler models are easier to understand and use than more complex models. As such, it is a good idea to start with the simplest models for a problem and increase complexity as you need. For example assume a linear form for your model before considering a non-linear, or a parametric before a non-parametric model.


In this README, you took a brief crash course in key concepts in statistics that you need when getting started in machine learning.

Specifically, the ideas of statistical inference, statistical populations, how ideas from big data fit in, and statistical models.

Take it slow, statistics is a big field and you do not need to know it all.

Don’t rush out and purchase an undergraduate textbook on statistics, at least, not yet. It is too much, too soon.

You can’t perform that action at this time.