### Statistic

#### Statistics is, at its core, about counting and measuring.

In order to do both effectively, we have to define scales on which we can count.

One type of scale is called equal interval.

Think of the speed of a car. 5 miles per hour is 5 miles per hour, no matter what the current speed is.

The difference between 60 and 55 miles per hour will always equal the difference between 10 and 5 miles per hour in real-world terms.

Another type of scale is a logarithmic scale.

The difference between a 5 and a 6 on the Richter scale is more than the difference between a 4 and 5.

This is because each number on the Richter scale means that the earthquake had 10 times the shaking amplitude of the previous number.

So, a 6 is 10 times more powerful (technically, powerful is the wrong term, but it makes thinking about this easier) than a 5, which is 10 times more powerful than a 4. A 6 is 100 times more powerful than a 4.

We can calculate the mean of values on an equal interval scale by adding up all the values and dividing by the number of values.

We could do the same for values on a non-equal interval scale, but the results wouldn't be meaningful, because of the differences between units.

### Ordinal Scales
So far, we've looked at equal interval and discrete scales, where all of the values are numbers.

But, we can also have ordinal scales, where items are ordered by rank.

For example, we could ask people how many cigarettes they smoke per day, and the answers could be "none", "a few", "some", "a lot".

These answers don't map exactly to numbers of cigarettes, but we know that "a few" is more than "none".

This is an ordinal rating scale, and we can assign numbers to the answers, in order, to make them easier to work with.

We could map 0 to "none", 1 to "a few", 2 to "some", and so on.

#### Skew
Now that we know how to make histograms, notice how the plots have a "shape" to them?

These shapes are important, and can show you distributional parameters of the data.

The first parameter we'll look at is called skew.


In [None]:
from scipy.stats import skew
positive_skew = skew(test_scores_positive)
negative_skew = skew(test_scores_negative)
no_skew = skew(test_scores_normal)

### Kurtosis
Another parameter of a distribution is called kurtosis.

Kurtosis measures whether the distribution is short and flat, or tall and skinny.

"Shorter" distributions have a lower maximum frequency, but higher subsequent frequencies.

In [None]:
from scipy.stats import kurtosis
kurt_platy = kurtosis(test_scores_platy)
kurt_lepto = kurtosis(test_scores_lepto)
kurt_meso = kurtosis(test_scores_meso)

### Modality
Modality is another parameter of distributions.

Modality refers to the number of modes, or peaks, in a distribution.

Real-world data often is unimodal (only has one mode).


This plot has one mode, making it unimodal
plt.hist(test_scores_uni)
plt.show()

This plot has two peaks, and is bimodal
#### This could happen if one group of students learned the material, and one learned something else, for example.
plt.hist(test_scores_bi)
plt.show()

#### More than one peak means that the plot is multimodal
#### We can't easily measure the modality of a plot, like we can with kurtosis or skew.
 Often, the best way to detect multimodality is to observe the plot.
 #### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### The Median
Another measure of central tendency is the median.

This is the midpoint of an array.

You have to sort the array, and then take the value in the middle.

If two values are in the middle (if there are an even number of items in the array), then you take the mean of the two middle values.


##### difference bwt median and mean
The median is less sensitive to very large or very small values (outliers), and is a more realistic center of the distribution.

### Variance-- how "spread out" the data is around the mean.
Let's look at variance in the data.

Variance tells us how "spread out" the data is around the mean.

We looked at kurtosis earlier, which measures the shape of a distribution.

Variance directly measures how far from the mean the average element in the data is.

We calculate variance by subtracting every value from the mean, squaring the results, and averaging them.

Mathemically, this looks like σ2=∑i=1n(xi−x¯)2nσ2=∑i=1n(xi−x¯)2n. σ2σ2 is variance, ∑ni=1∑i=1n means "the sum from 1 to n", where n is the number of elements in a vector. The formula does the exact same thing we just described, but is the most common way to show it.

The "pf" column in the data is the total number of personal fouls each player had called on them in the season -- let's look at its variance.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
# The nba data is loaded into the nba_stats variable.
# Find the mean value of the column
pf_mean = nba_stats["pf"].mean()
# Initialize variance at zero
variance = 0
# Loop through each item in the "pf" column
for p in nba_stats["pts"]:
    # Calculate the difference between the mean and the value
    difference = p - pf_mean
    # Square the difference -- this ensures that the result isn't negative
    # If we didn't square the difference, the total variance would be zero
    # ** in python means "raise whatever comes before this to the power of whatever number is after this"
    square_difference = difference ** 2
    # Add the difference to the total
    variance += square_difference
# Average the total to find the final variance.
variance = variance / len(nba_stats["pts"])

#### ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Standard Deviation -->measure how spread out data is.
A commonly used way to refer to how far data points are from the mean is called standard deviation.

It is typical to measure what percentage of the data is within 1 standard deviation of the mean, or two standard deviations of the mean.

Standard deviation is a very useful concept, and is a great way to measure how spread out data is.

Luckily for us, standard deviation is just the square root of the variance.


### Correlation
One thing that can help us a lot when we need to analyze a lot of variables is to measure correlation -- this means that we don't need to eyeball everything.

The most common way to measure correlation is to use Pearson's r, also called an r-value.

We'll go through how the calculations work, but for now, we'll focus on the values.

An r-value ranges from -1 to 1, and indicates how strongly two variables are correlated.

A 1 means perfect positive correlation -- this would show as a straight, upward-sloping line on our plots.

A 0 means no correlation -- you'll see a scatterplot with points placed randomly.

A -1 means perfect negative correlation -- this would show as a straight, downward-sloping line.

Anything between -1 and 0, and 0 and 1 will show up as a scattering of points. The closer the value is to 0, the more random the points will look. The closer to -1 or 1, the more like a line the points will look.

We can use a function from scipy to calculate Pearson's r for the moment.

In [None]:
from scipy.stats.stats import pearsonr

# The pearsonr function will find the correlation between two columns of data.
# It returns the r value and the p value.  We'll learn more about p values later on.
r, p_value = pearsonr(nba_stats["fga"], nba_stats["pts"])
# As we can see, this is a very high positive r value -- close to 1
print(r)

### Covariance
We looked at calculating the correlation coefficient with a function, but let's briefly look under the hood to see how we can do it ourselves.

Another way to think of correlation is in terms of variance.

Two variables are correlated when they both individually vary in similar ways.

For example, correlation occurs when if one variable goes up, another variable also goes up.

This is called covariance. Covariance is how things vary together.

There is a maximum amount of how much two variables can co-vary.

This is because of how each variable is individually distributed. Each individual distribution has its own variance. These variances set a maximum theoretical limit on covariance between two variables -- you can't co-vary more from the mean than the two variables individually vary from the mean.

The r-value is a ratio between the actual covariance, and the maximum possible positive covariance.

The maximum possible covariance occurs when two variables vary perfectly (ie, you see a straight line on the plot).

Let's look at actual covariance first. Mathematically speaking, covariance between two variables looks like this: cov(x,y)=∑ni=1(xi−x¯)(yi−y¯)ncov(x,y)=∑i=1n(xi−x¯)(yi−y¯)n. For each element in the vectors x and y, you take the value at each position from 1 to the length of the vectors. Subtract the mean of the vector from that value. Then multiply them together at each position, and all of the resulting values together.

#### correlation coefficient
Now that we know how to calculate covariance, we can calculate the correlation coefficient using the following formula:

cov(x,y)σxσycov(x,y)σxσy.

For the denominator, we need to multiple the standard deviations for x and y. This is the maximum possible positive covariance -- it's just both the standard deviation values multiplied. If we divide our actual covariance by this, we get the r-value.

You can use the std method on any Pandas Dataframe or Series to calculate the standard deviation. The following code returns the standard deviation for the pf column:


nba_stats["pf"].std()
You can use the cov function from NumPy to compute covariance, returning a 2x2 matrix. The following code returns the covariance between the pf and stl columns:


cov(nba_stats["pf"], nba_stats["stl"])[0,1]
cov(nba_stats["pf"], nba_stats["stl"])[0,1]

In [None]:
from numpy import cov
# The nba_stats variable has already been loaded.
r_fta_blk = cov(nba_stats["fta"], nba_stats["blk"])[0,1] /((nba_stats["fta"].var() * nba_stats["blk"].var())** (1/2)) 
r_ast_stl = cov(nba_stats["ast"], nba_stats["stl"])[0,1]/ ((nba_stats["ast"].var() *nba_stats["stl"].var())** (1/2))
