# Estimates of Location

Estimate of location typically try to identify where most of the data is located (i.e. its central tendency)

- **Mean:** The sum of all values divided by the number of values
    - Synonyms: Averge
- **Weighted Mean:** The sum of all values a weight divided by the sum of all weights
    - Synonyms: Weighted Average
- **Median**: The value such that one-half of the data lies above and below
    - Synonyms: 50th Percentile
- **Weighted Median:** The value such that one-half of the sum of the weights lies above and below the sorted data
- **Trimmed Mean:**: The average of all values after dropping a fixed number of extreme values
    - Synonyms: Truncated Mean
- **Robust**:  Not sensitive to extreme values
- **Outlier**: A data value that is very different from most of the data
    - Synonyms: Extreme Value
    
## Mean
\begin{equation}
Mean = \bar{x} = \frac{\sum_{i}^{n} x_i}{n}
\end{equation}

## Trimmed Mean
Calculated by removing a fixed number of sorted values at each end and then taking an average of the remaining values. $p$ in the formula below represents the number of smallest and largest values removed

\begin{equation}
Trimmed Mean = \bar{x} = \frac{\sum_{i = p + 1}^{n - p} x_i}{n - 2p}
\end{equation}

A trimmed mean removed the influence of extreme values.

## Weighted Mean

Each value $x_i$ is multiplied by a weight $w_i$ and dividing their sum by the sum of the weights.

\begin{equation}
Mean = \bar{x_w} = \frac{\sum_{i=1}^{n} x_i w_i}{\sum_i^n w_i}
\end{equation}

Use cases:
- Some values are intrinsically more valuable than others, and highly variable observations are given lower weight
- The data collected does not equally represent the different groups that we are interested in measuring e.g. can give higher weights to the group that is underrepresented

## Outliers

Median considered robust estimate of location because it is not influenced by outliers. An outlier is any value that is very distant from the other values in a data set.

## Example
- If you wanted to calculate the average murder rate for the country, you would need to use the weighted mean or median to account for different populations in states.

## Takeaways
- Basic metric for location is the mean but it can be sensitive to extreme values
- Other metrics such as median, trimmed mean are more robust

# Estimates of Variability

Estimating location is just one of many ways of looking at data. Another way is to look at the variability of the data which is essentially looking at the spread of your data.

- **Deviations:** The difference between the observed values and the estimate of location.
    - Synonyms: errors, residuals
- **Variance:** The sum of squared deviations from the mean divided by n – 1 where n is the number of data values.
    - Synonyms: mean-squared-error
- **Standard deviation:**: The square root of the variance.
    - Synonyms: l2-norm, Euclidean norm
- **Mean absolute deviation:** The mean of the absolute value of the deviations from the mean.
    - Synonyms: l1-norm, Manhattan norm
- **Median absolute deviation from the median:** The median of the absolute value of the deviations from the median.
- **Range:** The difference between the largest and the smallest value in a data set.
- **Order statistics:** Metrics based on the data values sorted from smallest to biggest.
    - Synonyms: ranks
- **Percentile:** The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.
    - Synonyms: quantile
- **Interquartile range:** The difference between the 75th percentile and the 25th percentile.
    - Synonyms: IQR
    
## Statistical Moments

In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays

## Exploring Data Distribution
- **Boxplot**: A plot introduced by Tukey as a quick way to visualize the distribution of data.
- **Frequency table**: A tally of the count of numeric data values that fall into a set of intervals (bins).
- **Histogram**: A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the y-axis.
- **Density plot**: A smoothed version of the histogram, often based on a kernal density estimate.

# Exploring Binary and Categorical Data

- **Mode**: The most commonly occurring category or value in a data set.
- **Expected value**: When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
- **Bar charts**: The frequency or proportion for each category plotted as bars.

## Expected Value
A special type of categorical data is data in which the categories represent or can
be mapped to discrete values on the same scale. A marketer for a new cloud
technology, for example, offers two levels of service, one priced at £300/month
and another at £50/month. The marketer offers free webinars to generate leads,
and the firm figures that 5% of the attendees will sign up for the £300 service,
15% for the £50 service, and 80% will not sign up for anything. This data can be
summed up, for financial purposes, in a single “expected value,” which is a form
of weighted mean in which the weights are probabilities.
The expected value is calculated as follows:
1. Multiply each outcome by its probability of occurring.
2. Sum these values.

In the cloud service example, the expected value of a webinar attendee is thus
£22.50 per month, calculated as follows:

$EV = (0.05)(300) + (0.15)(50)+(0.80)(0)=22.5$

The expected value is really a form of weighted mean: it adds the ideas of future
expectations and probability weights, often based on subjective judgment.
Expected value is a fundamental concept in business valuation and capital
budgeting—for example, the expected value of five years of profits from a new
acquisition, or the expected cost savings from new patient management software
at a clinic.

## Correlation

- **Correlation coefficient**: A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).
- **Correlation matrix**: A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.
- **Scatterplot**: A plot in which the x-axis is the value of one variable, and the y-axis the value of another.

### Correlation Key Notes
- The correlation coefficient measures the extent to which two variables are associated with one another.
- When high values of v1 go with high values of v2, v1 and v2 are positively associated.
- When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated.
- The correlation coefficient is a standardized metric so that it always ranges from –1 (perfect negative correlation) to +1 (perfect positive correlation).
- A correlation coefficient of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance.

## Exploring Two or More Variables
- **Contingency tables**: A tally of counts between two or more categorical variables.
- **Hexagonal binning**: A plot of two numeric variables with the records binned into hexagons. This plot is useful when you have a large number of data points in a scatter plot. Instead of visualising a large "blob" we could bin into hexagons with a colour indicating the number of records in that bin.
- **Contour plots**: A plot showing the density of two numeric variables like a topographical map.
- **Violin plots**: Similar to a boxplot but showing the density estimate.

### Exploring Two or More Variables Key Notes
- Hexagonal binning and contour plots are useful tools that permit graphical examination of two numeric variables at a time, without being overwhelmed by huge amounts of data.
- Contingency tables are the standard tool for looking at the counts of two categorical variables.
- Boxplots and violin plots allow you to plot a numeric variable against a categorical variable.