# Data and Graph Types, Summary Statistics, Correlation

## Table of Contents

- [Data Types](#data)
- [Graph Types](#graphs)
    - [Visualizing a Single Variable](#single)
    - [Visualizing Multiple Variables](#multi)
    - [Visualizing Hierarchical Data](#hier)
- [Summary Statistics](#sum)
    - [Measures of Central Tendency](#sum-central)
    - [Measures of Variability](#sum-var)
    - [Correlation (coefficient)](#corr)
- [Resources](#res)

In [59]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from math import modf

---
<a id='data'></a>

# Data Types

**Statistics** is the science of learning from data. However, there are different types of variables, and they record various kinds of information. Crucially, the type of information determines what you can learn from it, and, importantly, what you cannot learn from it. Consequently, you must understand the different types of data. **Data** carries information that you are gathering for an inquiry. **Data** are evidence you can use to answer questions.

<img src="images/stat-datatypes.png" alt="" style="width: 400px;"/>


## Quantitative versus Qualitative Data

- **Quantitative**: The information is recorded as `numbers` and represents an objective measurement or a count. Temperature, weight, and a count of transactions are all quantitative data. Analysts also refer to this type as `numerical data`.

    - **Continuous** variables can take on any numeric value, and the scale can be meaningfully divided into smaller increments, including fractional and decimal values. There are an infinite number of possible values between any two values. And differences between any two values are always meaningful. Typically, you measure continuous variables on a scale. For example, when you measure height, weight, and temperature, you have continuous data.
    
        - **Intervals**: On interval scales, the interval, or distance, between any two points is meaningful. For example, the 20-degree difference between 10 and 30 Celsius is equivalent to the difference between 50 and 70 degrees. However, `these scales don’t have a zero measurement that indicates the lack of the characteristic`. For example, Celsius has a zero measure, but it does not mean there is no temperature. `Datetime`'s also belong to intervals (we cannot say Wed is twice as Mon - like for Ratio data).
        
        Due to this lack of a true zero, `measurement ratios are not valid on interval scales`. Thirty degrees Celsius is not three times the temperature as 10 degrees Celsius. **You can add and subtract values on an interval scale, but you cannot multiply or divide them**.
        
        - **Ratios / Proportions**: On ratio scales, intervals are still meaningful. Additionally, `these scales have a zero measurement that represents a lack of the property`. For example, zero kilograms indicates a lack of weight. Consequently, measurements ratios are valid for these scales. 30 kg is three times the weight of 10 kg. **You can add, subtract, multiply, and divide values on an interval scale**.
        
    With continuous variables, you can assess properties such as the `mean`, `median`, `distribution`, `range`, and `standard deviation`. **Continuous variables** allow you to assess the wide variety of properties.
        
    Graphical representations: `histogram`, `scatter plot`, `time series`, `box plot`, `individual value plot`.
        
    - **Discrete** quantitative data are a count of the presence of a characteristic, result, item, or activity. These counts are nonnegative integers that cannot be divided into smaller increments. For example, a single household can have 1 or 2 cars, but it cannot have 1.6. There are a finite number of possible values between any two values. Other examples of discrete variables include class sizes, number of candies in a jar, and the number of calls that a call center receives.
    
    With discrete variables, you can calculate and assess a rate of occurrence or a summary of the count, such as the `mean`, `sum`, and `standard deviation`. For example, U.S. households had an average of 2.11 vehicles in 2014.
    
    Graphical representations: `bar chart`.

- **Qualitative**: The information represents characteristics that you do `not measure with numbers`. Instead, observations fall within a countable number of groups. This type of variable can capture information that isn’t easily measured and can be subjective. Taste, eye color, architectural style, and marital status are all types of qualitative variables.

    When you record information that categorizes your observations, you are collecting **qualitative data**. There are three types of qualitative variables: **categorical**, **binary**, and **ordinal**. With these data types, you’re often interested in the proportions of each category. 
    
    Graphical representations: `bar chart`, `pie chart` (useful for displaying counts and relative percentages of each group).
    
    - **Categorical** data have values that you can put into a countable number of distinct groups based on a characteristic. For a categorical variable, you can assign categories, but the categories have `no natural order`. Categorical variables can define groups in your data that you want to compare, such as the experimental conditions of treatment and control. Analysts also refer to categorical data as both **attribute** and **nominal** variables. For example, college major is a categorical variable that can have val- ues such as psychology, political science, engineering, biology, etc.
    
    - **Binary** data can have only two values. If you can place an observation into only two categories, you have a binary variable. Statisticians also refer to binary data as both **dichotomous** data and **indicator** variables. For example, pass/fail, male/female, and the presence/absence of a characteristic are all binary data. From binary variables, analysts can calculate `proportions` or `percentages`, such as the proportion of defective products in a sample. Simply take the number of faulty products and divide by the sample size.
    
    - **Ordinal** data have `at least three categories, and the categories have a natural order`. Examples of ordinal variables include overall status (poor to excellent), agreement (strongly disagree to strongly agree), and rank (such as sporting teams).
    
        **Ordinal variables** have a `combination of qualitative and quantitative properties`. On the one hand, these variables have a limited number of discrete values like categorical variables. On the other hand, the differences between the values provides some information like qualitive variables. However, the difference between adjacent values might not be consistent. For example, first, second, and third in a race are ordinal data. The difference in time between first and second place might not be the same the difference between second and third place.
    
        Analysts often represent **ordinal variables** using numbers, such as a 1-5 Likert scale that measures satisfaction. In number form, you can calculate `average scores` as with quantitative variables. However, `the numbers have limited usefulness because the differences between ranks might not be constant`.

In cases where you have a choice about recording a characteristic as a **continuous** or **qualitative** variable, `the best practice is to record the continuous data because you can learn so much more`.

### Levels of Measurement

<img src="images/stat-levelsmeasurement.png" alt="" style="width: 600px;"/>

---
<a id='graphs'></a>

# Graph Types

**Graphs** summarize a dataset visually. **Graphs** provide an intuitive view of your dataset, but there are several cautions about using them.

- Scaling can either amplify or diminish the appearance rela- tionships between variables.
- Making inferences about a population requires a hypothesis test.

## Histograms (1) - Distributions

**Histograms** are an excellent way `to graph continuous variables` because they show the distribution of values. Understanding the distribution allows you to determine which values are more and less common amongst other properties.

Each histogram bar spans a range of values for the continuous variable on the horizontal X-axis. These ranges are also known as `bins`. The height of each bar represents either the number or proportion of observations that fall within each bin.

<img src="images/stat-histogram.png" alt="" style="width: 400px;"/>

Histograms help in determining whether the distribution is symmetric or skewed (shape of the distribution), understanding the central tendency, spread of values, identifying where the most common values fall, and outliers.

Use histograms when you have continuous measurements and want to understand the distribution of values and look for outliers. They are fantastic exploratory tools because they reveal properties about your sample data in ways that summary statistics cannot.

A `measure of central tendency` is a single value that represents the center point or typical value of a dataset, such as the mean. A `measure of variability` is another type of summary statistic that describes how spread out the values are in your dataset. The standard deviation is a conventional measure of dispersion. Histograms indicate which values occur more and less frequently along with their dispersion.

I always recommend that you graph your data before assessing the numbers. The problem with summary statistics is that they are simplifications of your dataset. Graphing the data brings it to life much more fully and intuitively. Generally, I find that using graphs in conjunction with statistics provides the best of both worlds.

## Scatter Plot - Trends

When you have `two continuous variables`, you can graph them using a **scatterplot**. Scatterplots are great for displaying the relationship between two continuous variables. Each dot on the graph has an X and Y coordinate that corresponds to a pair of values for one subject or item.

On scatterplots, statisticians have guidelines for which variable you place along each axis.
- **X-axis**: This is the horizontal axis on the chart. Typically, analysts place the `explanatory variable` on this axis. This variable explains the changes in the other variable.
- **Y-axis**: This is the vertical axis on a graph. By tradition, analysts place the `outcome variable` on this axis. The other variable explains changes in this variable. 

In cases where there isn’t a clear explanatory and outcome relationship between variables, it does not matter where you place each variable.

<img src="images/stat-scatterplot.png" alt="" style="width: 400px;"/>

**Scatterplots** display relationships between pairs of continuous variables. A relationship between a pair of variables indicates the value of one variable depends on the value of another variable. In other words, if you know the value of one variable, you can predict the value of the other variable more accurately.

## Time Series - Trends & Patterns over Time

**Time series** plots do the same, except `one of the continuous variables is time`. These plots display the continuous variable over time and allow you to determine whether the continuous variable changes over time. You can look for both trends and patterns over time. Time series plots `typically take measurements at regular intervals`, such as daily, weekly, monthly, and annually. These graphs display time on the X-axis. The Y-axis shows the continuous measurement scale.

<img src="images/stat-timeseries.png" alt="" style="width: 400px;"/>


## Bar Chart (1)

**Bar charts** are a standard way `to graph discrete variables`. Each bar represents a distinct value, and the height represents its proportion in the entire sample. Use bar charts to indicate which values occur are more and less frequently.

<img src="images/stat-barchart.png" alt="" style="width: 400px;"/>

**Bar charts** and **histograms** look similar. However, the bars on a **histogram** touch while they are separate on a **bar chart**. Each bar on **histogram** represents a range of values that continuous measurements fall within. On a **bar chart**, a bar represents one of the discrete values.

## Pie Chart

**Pie charts** are great for highlighting the proportions that groups comprise of the whole. You can use them with `categorical, binary, and ordinal data` that define groups in your sample.

<img src="images/stat-piechart.png" alt="" style="width: 400px;"/>


## Histograms (2)

**Histograms** reveal:
- the shape of the distribution, and its central tendency, 
- the spread of values in your sample data.

You can also learn:
- how to identify outliers, 
- how histograms relate to probability distribution functions,
- why you might need to use hypothesis tests with them.

Histograms are the best method for detecting `multimodal distributions`.

You can `use histograms to contrast different groups`. Comparing properties across groups is a fundamental statistical method that can help you learn about a subject area. The primary way scientific experiments create new knowledge is by carefully setting up contrasts between groups, such as a treatment and control group.

Histograms are usually pretty good for displaying two groups, and up to four groups if you present them in separate panels. If your primary goal is to compare distributions and your histograms are challenging to interpret, consider using **boxplots** or **individual plots**. Those other plots are `better for comparing distributions when you have more groups`. But they don’t provide quite as much detail for each distribution as histograms.

# XXXX dac multimodal distribution, skewed... + dac python code do przykladow

<img src="images/stat-piechart2.png" alt="" style="width: 400px;"/>

## Boxplots and Individual Value Plots

**Boxplots** and **individual value plots** are used to compare distributions of continuous measurements between groups. Use boxplots and individual value plots when you have a `categorical grouping variable` and a `continuous outcome variable`. The categorical variables form the groups in your data, and the researchers measure the continuous variable. These graphs display relationships between a categorical variable and a continuous variable.

Both graphs allow you to compare the distribution of values between the groups in your sample data. You can assess properties such as the center, spread, and shapes of the distributions while looking out for `outliers` (unusual values in your dataset). 

### Individual Value Plots

**Individual value plots** display the value of each observation. This graph is best when you have fewer than 50 data points per group. With a larger sample size, the data points can become packed close together, jumbled, and hard to evaluate.
- Assess the central tendency by noting the vertical position of each group’s center.
- Assess the variability by gauging the vertical range of data points within each group.

<img src="images/stat-individualvalueplot.png" alt="" style="width: 400px;"/>

### Boxplots

Like individual value plots, use **boxplots** to compare the shapes of distributions, find central tendencies, assess variability, and identify outliers. **Boxplots** are also known as **box and whisker diagrams**.

Instead of displaying the raw data points, boxplots take your sample data and present ranges of values based on quartiles and display asterisks for outliers that fall outside the whiskers. Boxplots work by breaking your data down into quarters. When your sample size is too small, the quartile estimates might not be meaningful. Consequently, these graphs work best when you have at least 20 data points per group.

**Boxplots** display what statisticians refer to as a `five-number summary`, which are five vital descriptive statistics for samples. These values are the `minimum`, `first quartile`, `median`, `third quartile`, and `maximum`. These five values divide your data into quarters — at least approximately because the upper and lower whiskers do not include **outliers**, which the chart displays separately. Outliers display as aster- isks beyond the upper and lower whiskers.

<img src="images/stat-boxplot.png" alt="" style="width: 400px;"/>

When you’re assessing a single distribution, using a **histogram** is probably better. However, for comparing multiple distributions, **boxplots** are an excellent method.

<img src="images/stat-boxplot2.png" alt="" style="width: 400px;"/>


Histograms, individual value plots, and boxplots can contrast the distributions for groups within your dataset. They illustrate relationships between a categorical and continuous variable. However, what do you use when you have a pair of categorical variables? Use **two-way contingency tables** and **bar charts**!

## Two-Way Contingency Tables

**Two-way contingency tables** represent the frequency of combinations for two categorical variables. These tables help identify relationships between a pair of categorical variables. You can also graph them using **bar charts**. Each value in a table cell indicates the number of times researchers observed a particular combination of categorical values.

<img src="images/stat-contingencytable.png" alt="" style="width: 400px;"/>

The Total column indicates the researchers surveyed 66 females and 71 males. Because we have roughly equal numbers, we can `compare the raw counts directly`. However, when you have unequal groups, `use percentages to compare` them.

# zrobic raw values i raw percentages

## Bar Chart (2)

You can also use bar charts to display the results of a contingency table.

<img src="images/stat-barchart2.png" alt="" style="width: 400px;"/>


---
<a id='single'></a>

## Visualizing a Single Variable

<img src="images/stat-vis-type.png" alt="" style="width: 500px;"/>

---
<a id='multi'></a>

## Visualizing Multiple Variables

- **continuous** & **continuous**
    - scatter plot
    - scatter plot matrix
    
- **continuous** & **categorical**
    - box plot, violin plot (per each category)
    - faceting (side by side distribution plots per each category)
    
- **categorical** & **categorical**
    - heatmap with co-occurences count of one variable grupped by another variable
    
    <img src="images/stat-vis-heatmap.png" alt="" style="width: 500px;"/>

---
<a id='hier'></a>

## Visualizing Hierarchical Data

- treemap

    <img src="images/stat-vis-treemap.png" alt="" style="width: 400px;"/>

- circle packing diagram

    <img src="images/stat-vis-circle.png" alt="" style="width: 400px;"/>

- sunburst diagram

    <img src="images/stat-vis-sunburst.png" alt="" style="width: 400px;"/>


---
<a id='sum'></a>

# Summary Statistics

A **summary statistic** is a number derived from a dataset that summarizes a property of the entire dataset. 

There are four categories of summary statistics:

- **Measures of central tendency or location**, such as the mean.
- **Measures of spread or dispersion**, such as the standard deviation.
- **Measures of correlation or dependency**, such as Pearson’s correlation coefficient.
- **Measures of the shape of a distribution**, such as skewness or thickness of the tails. Summary statistics about the shape of a distribution are not used as commonly and generally not covered in basic statistics.

**Percentiles** measures the relative standing of an observation. Technically, it’s not a summary statistic, but it does indicate where a particular observation falls relative to the entire dataset. Although you can also use `percentiles as a measure of central tendency and a measure of variability`.

**Graphs** can give the wrong impression by changing the scaling of the axes. These problems are particularly prevalent when you’re comparing groups or assessing correlations between variables. **Summary statistics** help avoid these problems by providing an objective number.

## Percentiles

Percentiles are a great tool to use when you need to know the `relative standing of a value`. Where does a value fall within a distribution of values? While the concept behind percentiles is straightforward, there are different mathematical methods for calculating them. 

Percentiles tell you how a value compares to other values. The general rule is that `if value X is at the kth percentile, then X is greater than K% of the values`.

We give names to particular percentiles. The 50th percentile is the **median**. This value splits a dataset in half. Half the values are below the 50th percentile, and half are above it. The **median** is a `measure of central tendency` in statistics.

**Quartiles** are values that divide your data into quarters, and they are based on percentiles. 

- The first quartile, also known as **Q1** or the `lower quartile`, is the value of the 25th percentile. The bottom quarter of the scores fall below this value, while three-quarters fall above it.
- The second quartile, also known as **Q2** or the **median**, is the value of the 50th percentile. Half the scores are above and half below.
- The third quartile, also known as **Q3** or the `upper quartile`, is the value of the 75% percentile. The top quarter of the scores fall above this value, while three-quarters fall below it.

The **interquartile range (IQR)** is a `measure of dispersion` in statistics. This range corresponds to the distance between the first quartile and the third quartile **(IQR = Q3 – Q1)**. Larger IQRs indicate that the data are more spread out. The interquartile range represents the middle half of the data. One-quarter of the values fall below the IQR while another quarter of the values are above it.

Percentiles are surprisingly versatile because you can use them purposes other than just obtaining a relative standing. They can also divide your dataset into portions, identify the central tendency, and measure the dispersion of a distribution.

The following three definitions for percentiles based on data points define the kth percentile in the following different ways:

- **Definition 1**: The smallest value that is greater than k percent of the values.
- **Definition 2**: The smallest value that is greater than or equal to k percent of values.
- **Definition 3**: An interpolated value between the two closest ranks.

In [5]:
data = {1:2, 2:3, 3:6, 4:8, 5:13, 6:16, 7:22, 8:35, 9:40, 10:42, 11:48}
df = pd.DataFrame(list(data.items()), columns=['Rank', 'Value'])
df

Unnamed: 0,Rank,Value
0,1,2
1,2,3
2,3,6
3,4,8
4,5,13
5,6,16
6,7,22
7,8,35
8,9,40
9,10,42


In [95]:
# Find 65th percentile
p = 0.65

# Number of values
nvalues = df.shape[0]
nvalues

11

In [96]:
# Definition 1: Greater Than
perc1 = df['Value'][df['Rank'] == round(p*nvalues)+1]
print(perc1)
print(perc1.values[0])

7    35
Name: Value, dtype: int64
35


In [97]:
# Definition 2: Greater Than or Equal To
perc2 = df['Value'][df['Rank'] == round(p*nvalues)]
print(perc2)
print(perc2.values[0])

6    22
Name: Value, dtype: int64
22


In [98]:
# Definition 3: Using an Interpolation Approach
rank = p * (nvalues + 1)
print(rank)

if rank.is_integer():
    perc3 = rank
else:
    # Find values below and above the rank
    below = df['Value'][df['Rank'] == round(p*nvalues)].values[0]
    above = df['Value'][df['Rank'] == round(p*nvalues)+1].values[0]
    print(below, above)
    # Get fractional part of rank
    frac, whole = modf(rank)
    print(frac, whole)
    # Interpolate between the two closest observations
    perc3 = round(below + ((above - below) * frac))
    
print(perc3)

7.800000000000001
22 35
0.8000000000000007 7.0
32.0


In [109]:
f'Using three standard calculations for percentiles, we find three different values for the 70th percentile: {perc1.values[0]}, {perc2.values[0]}, {int(perc3)}'

'Using three standard calculations for percentiles, we find three different values for the 70th percentile: 35, 22, 32'

---
<a id='sum-central'></a>

## Measures of Central Tendency

A `measure of central tendency` is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall. In other words, it’s the central location of a distribution. 

In statistics, the three most common measures of central tendency are the **mean**, **median**, and **mode**. Each of these measures calculates the location of the central point using a different method. `Choosing the best measure of central tendency depends on the type of data you have`. You need to know the kind of data you have, and graph it, **before choosing a measure of central tendency**!

### Mean

The mean is the arithmetic average. When to use the mean: `Symmetric distribution`, `Continuous data`

<img src="images/stat-mean.png" alt="" style="width: 400px;"/>

The calculation of the mean incorporates all values in the data. If you change any value, the mean changes. However, the mean doesn’t always locate the center of the data accurately. `In a symmetric distribution, the mean locates the center accurately`. However, in a skewed distribution, the mean can miss the mark. This problem occurs because outliers have a substantial impact on the mean. Extreme values in an extended tail pull the mean away from the center. It’s best to use the mean as a measure of the central tendency when you have a symmetric distribution.

### Median

The median is the middle value. When to use the median: `Skewed distribution`, `Continuous data`, `Ordinal data`

It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it. The method for locating the median varies slightly depending on whether you have an even or odd number of values.

<img src="images/stat-median.png" alt="" style="width: 600px;"/>

Outliers and skewed data have a smaller effect on the median. Unlike the mean, the median value doesn’t depend on all the values in the dataset. Consequently, when some of the values are more extreme, the effect on the median is smaller. Of course, with other types of changes, the median can change. When you have a skewed distribution, the median is a better measure of central tendency than the mean.

In a symmetric distribution, the mean and median both find the center accurately. They are approximately equal.

<img src="images/stat-mean-median1.png" alt="" style="width: 400px;"/>

In skewed distributions, outliers in the tail pull the mean from the center towards the longer tail.

<img src="images/stat-mean-median2.png" alt="" style="width: 400px;"/>

### Mode

The mode is the value that occurs the most frequently in your data set. When to use the mode: `Categorical data`, `Ordinal data`, `Count data`, `Probability Distributions`

On a bar chart, the mode is the highest bar. When the data have multiple values that tie for occurring most often, you have a **multimodal distribution**. If no value repeats, the data do not have a mode.

Typically, you use the **mode** with `categorical, ordinal, and discrete data`. In fact, the mode is the only measure of central tendency that you can use with categorical data - such as the most preferred flavor of ice cream. However, with categorical data, there isn’t a central value because you can’t order the groups. With ordinal and discrete data, the mode can be a value that is not in the center. Again, the mode represents the most common value.

With `continuous data`, it is unlikely that two or more values will be exactly equal because there are an infinite number of values between any two values. You can find the **mode** for continuous data by locating the maximum value on a `probability distribution plot`. If you can identify a probability distribution that fits your data, find the peak value, and use it as the mode. `Probability distribution plots are the best way to find a mode for continuous data`.

<img src="images/stat-mode.png" alt="" style="width: 400px;"/>


When you have a symmetrical distribution for `continuous data`, the **mean**, **median**, and **mode** are equal. In this case, analysts tend to use the **mean** because it includes all of the data in the calculations. However, if you have a skewed distribution, the **median** is often the best measure of central tendency. When you have `ordinal data`, the **median** or **mode** is usually the best choice. For `categorical data`, you have to use the **mode**.

<img src="images/stat-mean-median-mode-range.png" alt="" style="width: 600px;"/>


---
<a id='sum-var'></a>

## Measures of Variability

A **measure of variability** is a summary statistic that represents the `amount of dispersion` in a dataset. How spread out are the values? While a measure of central tendency describes the typical value, measures of variability define how far away the data points tend to fall from the center. 

In statistics, `variability`, `dispersion`, and `spread` are synonyms that de- note the width of the distribution. The most common measures of variability are the **range**, **interquartile range**, **variance**, and **standard deviation**.

Analysts frequently use the mean to summarize the center of a population or a process. While the mean is relevant, people often react to variability even more. When a distribution has lower variability, the values in a dataset are more consistent. However, when the variability is higher, the data points are more dissimilar and extreme values become more likely. Consequently, understanding variability helps you grasp the likelihood of unusual events.

### Range

The **range** of a dataset is the difference between the largest and smallest values in that dataset. While the range is easy to understand, it is `based on only the two most extreme values` in the dataset, which makes it `very susceptible to outliers`. If one of those numbers is unusually high or low, it affects the entire range even if it is atypical.

Additionally, `the size of the dataset affects the range`. In general, you are less likely to observe extreme values. However, as you increase the sample size, you have more opportunities to obtain extreme values. Consequently, when you draw random samples from the same population, `the range tends to increase as the sample size increases`. Accordingly, **use the range to compare variability only when the sample sizes are similar**.

### The Interquartile Range (IQR)

As you learned before, the **interquartile range** is the middle half of the data. The interquartile range is the middle half of the data that is in between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that fall between Q1 and Q3.

<img src="images/stat-iqr.png" alt="" style="width: 400px;"/>

`The broader the IQR, the higher the variability in your dataset`. Additionally, the **interquartile range** is a robust measure of variability like the **median** is a robust measure of central tendency. `Outliers don’t dramatically influence either measure because neither depends on every value`. Additionally, the **interquartile range** is `excellent for skewed distributions`, just like the median. When you have a normal distribution, the standard deviation tells you the percentage of observations that fall specific distances from the mean. However, this doesn’t work for skewed distributions, and the **IQR** is an excellent alternative.

When you have a `skewed distribution`, reporting the median with the interquartile range is a particularly good combination. 

The **interquartile range** is equivalent to the region between the 75th and 25th percentile (75 – 25 = 50% of the data). You can also use other percentiles to determine the spread of different proportions. For example, the range between the 97.5th percentile and the 2.5th percentile covers 95% of the data.

### Variance

**Variance** is the average squared difference of the values from the mean. Unlike the previous measures of variability, variance `includes all values in the calculation` by comparing each value to the mean.

<img src="images/stat-variance-std.png" alt="" style="width: 300px;"/>

- Miu is the population mean
- N is the population size
- x hat is the sample mean
- n is sample size
- n-1 in the denominator corrects for the tendency of a sample to underestimate the population variance.

Because the calculations use the squared differences, the `variance is in squared units rather the original units of the data`. While higher values of the variance indicate greater variability, there is `no intuitive interpretation for specific values`. Despite this limitation, various statistical tests use the variance in their calculations. While it is difficult to interpret the variance itself, the **standard deviation** resolves this problem!

### Standard Deviation

The **standard deviation** is the standard or typical difference between each data point and the mean. When the values in a dataset are grouped closer together, you have a smaller standard deviation. On the other hand, when the values are spread out more, the standard deviation is larger because the standard distance is greater.

Conveniently, the **standard deviation** `uses the original units of the data, which makes interpretation easier`. Consequently, the standard deviation is the most widely used measure of variability. The standard deviation is just the square root of the variance. Recall that the variance is in squared units.

When you have normally distributed data, or approximately so, the **standard deviation** becomes particularly valuable. You can use it to determine the proportion of the values that fall within a specified number of standard deviations from the mean.

---
<a id='corr'></a>

## Correlation

A **correlation** between variables indicates that `as one variable changes in value, the other variable tends to change in a specific direction`. Or, you can state it as a dependency. The `value of one variable depends, to some degree, upon the value of another variable`. Correlation measures the strength of that association. Understanding that relationship is useful because `we can use the value of one variable to predict the value of the other variable`.

In statistics, **correlation coefficients** are a `quantitative assessment that measures both the direction and the strength` of this tendency to vary together. There are different types of correlation that you can use for different kinds of data. 

### Pearson’s correlation coefficient

**Pearson’s correlation** measures the strength of the relationship between a pair of continuous variables. **Pearson’s correlation coefficient** is represented by the Greek letter `rho (ρ)` for the population parameter and `r` for a sample statistic. This correlation coefficient is a single number that measures both the strength and direction of the linear relationship between two contin- uous variables. Values can range from -1 to +1.

- **Strength**: The greater the absolute value of the correlation coefficient, the stronger the relationship.
    - The extreme values of `-1` and `1` indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, `all of the data points fall on a line`. In practice, you won’t see either type of perfect relationship.
    - A coefficient of `zero` represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
    - When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases, and the data points tend to fall closer to a line.
- **Direction**: The `sign of the correlation coefficient` represents the direction of the relationship.
    - `Positive` coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an `upward slope on a scatterplot`.
    - `Negative` coefficients represent indicate that when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a `downward slope`.
    
Correlation Coefficient = +1: A perfect positive relationship:

<img src="images/stat-corr1.png" alt="" style="width: 400px;"/>

Correlation Coefficient = 0.8: A fairly strong positive relationship:

<img src="images/stat-corr2.png" alt="" style="width: 400px;"/>

Correlation Coefficient = 0.6: A moderate positive relationship:

<img src="images/stat-corr3.png" alt="" style="width: 400px;"/>

Correlation Coefficient = 0: No relationship. As one value increases, there is no tendency for the other value to change in a specific direc- tion:

<img src="images/stat-corr4.png" alt="" style="width: 400px;"/>

Correlation Coefficient = -1: A perfect negative relationship:

<img src="images/stat-corr5.png" alt="" style="width: 400px;"/>

Correlation Coefficient = -0.8: A fairly strong negative relationship:

<img src="images/stat-corr6.png" alt="" style="width: 400px;"/>

Correlation Coefficient = -0.6: A moderate negative relationship:

<img src="images/stat-corr7.png" alt="" style="width: 400px;"/>

Weaker correlations that are closer to zero than `0.6 and -0.6` start to look like blobs of dots and it’s hard to see the relationship. However, `there is no one-size-fits-all best answer for how strong a relation- ship should be`. The correct correlation value depends on your study area.

**Pearson’s correlation coefficients** measure `only linear relationships`. Consequently, if your data contain a curvilinear relationship, the cor- relation coefficient will not detect it. 

## Correlation Does Not Imply Causation

**Correlation** between two variables indicates that changes in one variable are associated with changes in the other variable. However, `correlation does not mean that changes in one variable actually cause changes in the other variable`. Instead, it’s merely an **association**. As the value of variable A increases, the value of variable B also increases. However, if it is `correlation but not causation`, then the changes in variable A are not causing the changes in variable B.

How does it come to be that variables are correlated but do not have a causal relationship? A common reason is a **spurious correlation**. A **spurious correlation** is a situation where two variables appear to have an association, but in reality, a third factor causes that association. That third factor is known as a **confounding variable**. A **confounding variable** correlates with both of your variables of interest and can create confusion about which relationships are causal and which are merely spurious associations.

In statistics, you typically need to `perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation`.

---
<a id='res'></a>

# Resources

- [Statistics by Jim](https://statisticsbyjim.com/)
- [onlinemathlearning.com](https://www.onlinemathlearning.com)
- [Programming Skills for Data Science](https://programming-for-data-science.github.io/)