#  Statistics for Machine Learning

# Data Types

<img src ="images/datatypes.png">

## Categorical Data

-------------------------------------------------------------------------------------------------------------------------------

Categorical data represents characteristics. Therefore it can represent things like a person’s gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning

### Nominal Data
Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as „labels“. Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change.

### Ordinal Data
Ordinal values represent discrete and ordered units. It is therefore nearly the same as nominal data, except that it’s ordering matters.

## Numerical Data

-------------------------------------------------------------------------------------------------------------------------------

## Discrete Data

We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips.

## Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but they can be measured. 

#  Measures of Central Tendency

## Mean

The mean (represented by the greek letter mu— μ) is the average of a dataset.

In [8]:
import numpy as np
numbers=[-4,-3,-2,-1,0,1,2,3,4]

mean=np.mean(numbers)
print("The mean of the Number is :{}".format(mean))

The mean of the Number is :0.0


## Median

The median is the middle of a dataset. To calculate the median, we sort all the values (in ascending or descending order) and take the one that is in the middle.

If there is an even number of data points, then we calculate the mean of the two that fall in the middle.
The median is less susceptible to outliers than the mean, and hence we need to take into consideration how the data distribution looks like, to choose which one to use

In [7]:
import numpy as np
numbers=[-4,-3,-2,-1,0,1,2,3,4]

median=np.median(numbers)
print("The median of the Number is :{}".format(median))

The median of the Number is :0.0


## Mode

The mode is the most common value in the dataset. To calculate the mode, we locate the number that occurs more frequently.

In [12]:
from scipy import stats

numbers=[-4,-3,-2,-1,0,1,2,3,-4]

mode=stats.mode(numbers)
print("The median of the Number is :{}".format(mode[0]))

The median of the Number is :[-4]


## Measures of Variability

## Range

Range is the difference between the lowest and the highest number of a dataset. To calculate the range, we subtract the minimum from the maximum value.
It shows us how varied the dataset is, i.e. how spread it is, but again, like mean, it is really sensitive to outliers.

In [15]:
import numpy as np
numbers=[-4,-3,-2,-1,0,1,2,3,4]

range_data=np.max(numbers)-np.min(numbers)
print("The range of the Number is :{}".format(range_data))

The range of the Number is :8


## Variance

Variance measures how spread out the data is. To calculate the variance, we take the average of the squared differences from the mean.

In [21]:
import numpy as np
numbers=[-4,-3,-2,-1,0,1,2,3,4]

mean=np.mean(numbers)
numbers2=numbers-mean
numbers2=np.square(numbers2)
variance=np.sum(numbers2)/len(numbers2)
print("The Variance of the Number is :{}".format(variance))

The Variance of the Number is :6.666666666666667


## Standard Deviation

Standard deviation (represented by the greek letter sigma — σ) is just the square root of the variance.
It is a measure of dispersion in terms of how many standard deviations it is away from the mean, and as we will see in a following article it is used to judge which data point is an outlier.

In [22]:
import numpy as np
numbers=[-4,-3,-2,-1,0,1,2,3,4]

mean=np.mean(numbers)
numbers2=numbers-mean
numbers2=np.square(numbers2)
variance=np.sum(numbers2)/len(numbers2)
std_dev=np.sqrt(variance)
print("The Standard Deviation of the Number is :{}".format(std_dev))

The Standard Deviation of the Number is :2.581988897471611


# Data Distribution

## Probability Space

In probability theory, a probability space or a probability triple {Omega,F,P} is a mathematical construct that models a real-world process (or “experiment”) consisting of states that occur randomly. A probability space is constructed with a specific kind of situation or experiment in mind. One proposes that each time a situation of that kind arises, the set of possible outcomes is the same and the probabilities are also the same.

A probability space consists of three parts:

A sample space,Omega which is the set of all possible outcomes.
A set of events {F}, where each event is a set containing zero or more outcomes.
The assignment of probabilities to the events; that is, a function P from events to probabilities.

## Probability Distribution Functions

A probability distribution is a function that describes the likelihood of an event or outcome. They come in many shapes, but in only one size: probabilities in a distribution always add up to 1. We will now delve into the different types of distributions, in terms of the dataset being continuous or discrete.


## Probability Density Function (PDF)

When we see a graph like the one in the figure below, we think that it shows the probability of a given value occurring. However, this is not entirely true for continuous data because there is an infinite number of data points. As such the probability of a specific value happening can be very small — infinitely small!
The PDF represents the probability of a given range of values occurring. And hence the word ‘density’! To visualise the probability, we plot the dataset as a curve. The area under the curve between two points corresponds to the probability that the variable falls between those two values.


<img src="images/normaldist.png">

Between the mean and one standard deviation (1σ) there is 34.1% possibility of a value landing in that range. So for a given value there is 68.2% chance to fall between -1σ and 1σ — which is very likely!!!
What this means is that there is a concentration of values near the mean and as we get out beyond the one standard deviations (+-), the probability gets smaller and smaller.

## Probability Mass Function (PMF)
When it comes to discrete data, the PMF is the measure that gives us the probability of a given value occurring. To visualise the probability, we plot the dataset as a histogram.

<img src ="images/pmf.png">

# Continuous Data Distributions

## PDF-1: Uniform / Rectangular Distribution

A uniform distribution means there is a flat constant probability of a value occurring within a given range, and is concerned with events that are equally likely to occur.

<img src ="images/ud.png">

## PDF-2: Normal / Gaussian Distribution

We saw a standard normal distribution when we explored what PDF is. If we introduce the ‘random’ element, a normal distribution does not look like a perfect curve, but more like the following example.
The mean for the standard normal distribution is zero, and the standard deviation is one.

<img src = "images/nd.png">

## PDF-3: T Distribution
T or Student’s T distribution looks a lot like the bell shaped curve of a normal distribution, but is a bit shorter and with heavier tails. It is used (instead of the normal distribution) when we have small samples and/or when the population variance is unknown.

<img src="images/td.png">

## PMF-4: Chi-Square Distribution


__The chi-square (χ²) distribution is used to assess a series of problems:__

- whether a data set fits a particular distribution
- whether the distributions of two populations are the same
- whether two events might be independent
- whether there is a different variability within a population.
- The curve is skewed to the right.

<img src ="images/chi.png">

# Measures of Location

## Percentiles
Percentiles divide ordered data into hundredths. In a sorted dataset, a given percentile is the point at which that percent of the data is less than the point we are at.


## Quartiles
Quartiles are special percentiles, which divide the data into quarters. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median is called both the second quartile, Q2, and the 50th percentile.


## Interquartile Range (IQR)
The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is and can help determine outliers. It is the difference between the Q3 and Q1.

__Generally speaking, outliers are those data points that fall outside from the Q1 – 1.5 x IQR and Q3 + 1.5 x IQR range.__

# Moments
Moments describe various aspects of the nature and shape of our distribution.
#1 — The first moment is the mean of the data, which describes the location of the distribution.
#2 — The second moment is the variance, which describes the spread of the distribution. High values are more spread out than smaller values.
#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution is. A positive skew means we have a left lean and a long right tail. This means that the mean is to the right of the bulk of our data. And vice versa:

<img src ="images/skewness.png">

#4 — The fourth moment is the kurtosis, which describes how thick the tail is and how sharp the peak is. It indicates how likely it is to find extreme values in our data. Higher values make outliers more likely. This sounds a lot like spread (variance) but is subtly different.

<img src ="images/kurt.png">

## Introduction tp Covariance and Corelation

To lay the basis for this article, we will assume we have a scatterplot and each data point represents a person: their professional experience in years on one axis versus their income on another.

<img src="images/corelation.png">

## Covariance
Covariance is a measure of association between two (or more) random variables.
As the name ‘co + variance’ implies, it is like the variance, but applied to a comparison of two variables — in place of the sum of squares, we have a sum of cross-products.
While Variance tells us how a single variable varies from the mean; Covariance tells us how two variables vary from each other. As such it’s fair to say:
Covariance is measuring the Variance between two variables.
Covariance can be negative or positive (or zero obviously): A positive value means that the two variables tend to vary in the same direction (i.e. if one increases, then the other one increases too), a negative value means that they vary in opposite directions (i.e. if one increases, then the other one decreases), and zero means that they don’t vary together.

### Formula
The formula might be hard to interpret, but it is more important to understand what it means:

<img src ="images/cova.png">

If we think that the dataset of a random variable is represented as a vector, then in the previous example, we have two vectors for experience and income. Here are the steps we need to follow:
#1. Convert these two vectors to vectors of variances from the mean.
#2. Take the dot product of the two vectors (which is equal to the cosine of the angle between them).
#3. Divide by the sample size (n or n - 1, as discussed before, based on whether it is full population or not).
On the 2nd step, we effectively measure the angle between these two vectors, so if they are close to each other, it means that these variables are tightly coupled.
## Main Limitation
It is important to note that while the Covariance does measure the directional relationship between two variables, it does not show the strength of the relationship between them.
In practice, the biggest problem with this metric is that it depends on the units used. For example, if we were to convert the years of experience into months of experience, then the Covariance would be 12 times larger!
This is where Correlation comes in!

# Correlation

The Correlation is one of the most common metrics in Statistics that describes the degree of relationship between two random variables. It is considered to be the normalised version of the Covariance. Let’s see why…
## Formula
The Correlation (represented by the Greek letter ρ — rho) can be expressed using this formula:

<img src="images/corel.png">

- The correlation is bounded between -1 and 1. Like the Covariance, the sign of the Correlation indicates the direction of the relationship: positive means that random variables move together, negative means that random variables move in different directions.
- The endpoints (i.e. 1 and -1) indicate that there is a perfect relationship between the two variables. For instance, the relationship between meters and centimetres is always that 1m corresponds to 100cm. If we plot this relationship it will be a perfect line, and therefore the Correlation is 1.
- _Please note that a perfect relationship is pretty rare in real life data, since two random variables don’t usually map to each other by a constant factor.
- _A Correlation of 0 means that there is no linear relationship between the two variables. There might be a x = y² relationship.
### Key Characteristics
The Correlation does not only indicate the direction of the relationship but also its strength, (depending on how big the absolute value is) as it is unitless: Since we divided the Covariance by the Standard Deviation, the units were cancelled out.
Finally, we need to remember that ‘Correlation does not imply Causation’: a high correlation between two random variables just means that they are associated with each other, but their relationship is not necessarily causal in nature. The only way to prove causation is with controlled experiments, where we eliminate outside variables and isolate the effects of the two variables in question.

# Conditional Probability

Conditional probability is the likelihood of an event occurring, based on the occurrence of a previous event.
The notation for conditional probability is P(A|B), read as ‘the probability of A given B’. The formula for conditional probability is:

<img src ="images/condp.png">

# Bayes’ Theorem

Having just explored what the Conditional Probability is, let’s take a look at the Bayes’ Theorem. It simply says:
The probability of A given B is equal to the probability of B given A times the probability of A, over the probability of B.

<img src ="images/bayes.png">

# Hypothesis Testing

## Introduction

Hypothesis testing is a critical tool in inferential statistics, for determining what the value of a population parameter could be. We often draw this conclusion based on a sample data analysis.

With the advent of data-driven decision making in business, science, technology, social, and political undertakings, the concept of hypothesis testing has become critically important to understand and apply in the right context.



Hypothesis Testing
Hypothesis testing was introduced by Ronald Fisher, Jerzy Neyman, Karl Pearson and Pearson’s son, Egon Pearson.   Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.  Hypothesis Testing is basically an assumption that we make about the population parameter.
__Key terms and concepts:__

- __Null hypothesis__: Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor.  Null hypothesis is denoted by; H0: μ1 = μ2, which shows that there is no difference between the two population means.
- __Alternative hypothesis__: Contrary to the null hypothesis, the alternative hypothesis shows that observations are the result of a real effect.
- __Level of significance__: Refers to the degree of significance in which we accept or reject the null-hypothesis.  100% accuracy is not possible for accepting or rejecting a hypothesis, so we therefore select a level of significance that is usually 5%.
- __Type I error__: When we reject the null hypothesis, although that hypothesis was true.  Type I error is denoted by alpha.  In hypothesis testing, the normal curve that shows the critical region is called the alpha region.
- __Type II errors__: When we accept the null hypothesis but it is false.  Type II errors are denoted by beta.  In Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.
- __Power__: Usually known as the probability of correctly accepting the null hypothesis.  1-beta is called power of the analysis.
- __One-tailed test__: When the given statistical hypothesis is one value like H0: μ1 = μ2, it is called the one-tailed test.
- __Two-tailed test__: When the given statistics hypothesis assumes a less than or greater than value, it is called the two-tailed test.

https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/hypothesis-testing/