# Statistics in Python

## Central Tendency
The *mean, median, mode* are the three statistical measures for analyzing the __central tendency__ of data. These measures are used to find the central value of the data to summarize the entire data set. There are three additional measures of central tendency: the *trimean*, the *geometric mean*, and the *trimmed mean*.

### Mean $(\bar x)$
The **mean** or **average** is the one number that best represents all these data points.

$$\bar x=\frac{x_1+x_2+...+x_n}{n}$$

### Median
The __median__ is the value separating the higher half from the lower half of a data sample (a population or a probability distribution). For a data set, it may be thought of as the "middle" value. The basic advantage of the median in describing data compared to the mean is that it is not skewed so much by extremely large or small values.

<div class="alert alert-info">The mean and the median are the same for symmetric distributions.</div>

### Mode
The __mode__ of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.

<div class="alert alert-info">The mean of the dataset would always change if there is a change in any value of the dataset. Median and mode may or may not change with altering a single value in the dataset.</div>

### Trimean
The __trimean__ is a weighted average of the 25th percentile, the 50th percentile, and the 75th percentile. Letting $P_{25}$ be the 25th percentile, $P_{50}$ be the 50th and $P_{75}$ be the 75th percentile, the formula for the trimean is:
$$Trimean=\frac{P_{25}+2P_{50}+P_{75}}{4}$$

The median $P_{50}$ is weighted twice as much as the 25th and 75th percentiles.

### Geometric mean
The __geometric mean__ is computed by multiplying all the numbers together and then taking the nth root of the product. For example, for the numbers 1, 10, and 100, the product of all the numbers is: 1 x 10 x 100 = 1,000. Since there are three numbers, we take the cubed root of the product (1,000) which is equal to 10. The formula for the geometric mean is:
$$Geometric Mean=(\Pi x)^{\frac{1}{N}}$$
where the symbol $\Pi$ means to multiply.

### Trimmed mean
__Trimmed mean__ is obtained by removing some of the higher and lower scores and compute the mean of the remaining scores. A mean trimmed 10% is a mean computed with 10% of the scores trimmed off: 5% from the bottom and 5% from the top. A mean trimmed 50% is computed by trimming the upper 25% of the scores and the lower 25% of the scores and computing the mean of the remaining scores. The trimmed mean is similar to the median which, in essence, trims the upper 49+% and the lower 49+% of the scores. Therefore the trimmed mean is a hybrid of the mean and the median. 

## Variability
![line](https://github.com/jamwine/Statistics/blob/master/imgs/line.PNG?raw=true)
### Range
The **Range** describes the extent of variability by considering the distance between the biggest and the smallest values. The larger the range, the more noticeable the variation in the data will be. However, the greatest disadvantage is that it ignores the mean, and is swayed by the outliers; that's where variance comes in.
$$range=x_{max}-x_{min}$$

### Mean Deviation
For Variance, we first calculate __Mean Deviation__ and __Squared Mean Deviation__:

$$Mean\ Deviation=x_i-\bar x$$

$$Squared\ Mean\ Deviation=(x_i-\bar x)^2$$

The sum of mean deviations of the individual will always be 0.

### Variance $(\sigma^2)$ 

__Variance__ is the second-most important number to summarize this set of data points. It is the mean squared deviation of a variable from its mean. The higher the variance, the larger the variability of the data.

$$Variance=\frac{\sum (x_i-\bar x)^2}{n-1}$$

|![wh](https://github.com/jamwine/Statistics/blob/master/imgs/head.png?raw=true)|![wh](https://github.com/jamwine/Statistics/blob/master/imgs/head.png?raw=true)|
|-|-|
|![var1](https://github.com/jamwine/Statistics/blob/master/imgs/var1.PNG?raw=true)|![var2](https://github.com/jamwine/Statistics/blob/master/imgs/var2.PNG?raw=true)|

Variance is improved by tweaking the denominator - this is called __Bessel's Correction__.

Variance Properties:
* If a constant is added to each data point, the variance remains the same. Adding a constant shifts the distribution and doesn't affect variability of the data. The deviations from the mean remains the same as well.
* If a constant is multiplied to each observation, the standard deviation increases proportionally to the constant and the variance increases to the squared constant.

### Standard Deviation $(\sigma)$
Mean and variance succinctly summarize a set of numbers. __Standard Deviation__ is the square root of Variance.

$$Std Dev=\sqrt{\frac{\sum (x_i-\bar x)^2}{n-1}}$$

Standard Deviation is the most common way to estimate the uncertainity of a set of outcomes.

![std1](https://github.com/jamwine/Statistics/blob/master/imgs/std1.PNG?raw=true)

### Percentile Rank
A __percentile rank__ is typically defined as the proportion of scores in a distribution that a specific score is greater than or equal to. For instance, if there is a score of 95 on a math test and this score was greater than or equal to the scores of 88% of the students taking the test, then the percentile rank would be 88 (88th percentile). Alternatively, percentile rank is sometimes defined simply as the proportion of a distribution that a score is greater than.

## Frequency Distribution

The distribution of empirical data is called a __frequency distribution__ and consists of a count of the number of occurrences of each value. If the data are continuous, then a grouped frequency distribution is used. Typically, a distribution is portrayed using a frequency polygon or a histogram. Mathematical equations are often used to define distributions. The normal distribution is, perhaps, the best known example.

A __grouped frequency distribution__ is a frequency distribution in which frequencies are displayed for ranges of data rather than for individual values. For example, the distribution of heights might be calculated by defining one-inch ranges. The frequency of individuals with various heights rounded off to the nearest inch would then be tabulated.

A __histogram__ and a __frequency polygon__ are the graphical representations of a distribution. They partition the variable on the x-axis into various contiguous class intervals of (usually) equal widths. The heights of the _bars_ in a histogram and the _polygon's points_ in a frequency polygon represent the class frequencies.

|![wh](https://github.com/jamwine/Statistics/blob/master/imgs/head.png?raw=true)|![wh](https://github.com/jamwine/Statistics/blob/master/imgs/head.png?raw=true)|
|-|-|
|![hist](https://github.com/jamwine/Statistics/blob/master/imgs/hist.PNG?raw=true)|![fp](https://github.com/jamwine/Statistics/blob/master/imgs/fp.PNG?raw=true)|