## Basic Statistics

**Table of Contents:**
1. Variance
2. Standard Deviation
3. Covariance
4. Correlation

In [1]:
# We will use NumPy for this exercise
import numpy as np

### 1. Variance

The variance of a dataset is a measure of how spread out the values are. It is calculated using the following formula:

$$
\text{Sample Variance} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2
$$

Where:
- $N$ is the number of data points
- $x_i$ is each individual data point
- $\bar{x}$ is the mean of the data points

In [2]:
# Define a sample data set
X = np.array([1, 2, 3, 4, 5])

# Calculate sample variance
variance_manual = np.sum((X - X.mean())**2) / (X.size - 1)
variance_numpy = np.var(X, ddof=1) # ddof=1 is used to calculate the sample variance (default is ddof=0 which calculates the population variance)

# They should be the same
print(f"Sample variance (NumPy): {variance_numpy}")
print(f"Sample variance (Manual): {variance_manual}")

Sample variance (NumPy): 2.5
Sample variance (Manual): 2.5


### 2. Standard Deviation

The sample standard deviation is also a measure of how spread out the values are. Variance squares the differences from the mean, which changes the original data's units. Standard deviation is square root of variance. Taking the square root reverses this, returning the measure of spread to the original data's scale and units. Standard deviation is calculated using the following formula:

$$
\text{Sample Standard Deviation} = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2} = \sqrt{\text{Sample Variance}}
$$

Where:
- $N$ is the number of data points
- $x_i$ is each individual data point
- $\bar{x}$ is the mean of the data points

In [3]:
# Calculate sample variance
std_manual = np.sqrt(variance_manual)
std_numpy = np.std(X, ddof=1) # ddof=1 is used to calculate the sample std (default is ddof=0 which calculates the population std)

# They should be the same
print(f"Sample standard deviation (NumPy): {std_manual:.2f}")
print(f"Sample standard deviation (Manual): {std_numpy:.2f}")

Sample standard deviation (NumPy): 1.58
Sample standard deviation (Manual): 1.58


### 3. Covariance

Covariance is a measure of how much two random variables vary together. The covariance between two variables \(X\) and \(Y\) is calculated using the following formula:

$$
\text{Cov}(X, Y) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})
$$

Where:
- $N$ is the number of data points
- $x_i$ and $y_i$ are the individual data points of $X$ and $Y$
- $\bar{x}$ and $\bar{y}$ are the means of $X$ and $Y$ respectively

In [4]:
# Define another sample to calculate covariance and correlation
Y = np.array([5, 4, 3, 2, 1])

# Calculate sample covariance
cov_manual = np.sum((X - X.mean())*(Y - Y.mean())) / (X.size - 1)
cov_numpy = np.cov(X, Y, ddof=1)[0, 1]

# They should be the same
print(f"Sample covariance (NumPy): {cov_manual:.2f}")
print(f"Sample covariance (Manual): {cov_numpy:.2f}")

Sample covariance (NumPy): -2.50
Sample covariance (Manual): -2.50


### 4. Correlation

Correlation is a measure of the strength and direction of the linear relationship between two variables. It is a standardized version of covariance and is calculated using the following formula:

$$
\text{Correlation}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
$$

Where:
- $\text{Cov}(X, Y)$ is the covariance between $X$ and $Y$
- $\sigma_X$ is the standard deviation of $X$
- $\sigma_Y$ is the standard deviation of $Y$

In [5]:
# Show samples
print(f"X: {X}")
print(f"Y: {Y}")

# Calculate sample correlation
corr_manual = cov_manual / (std_manual * np.std(Y, ddof=1))
corr_numpy = np.corrcoef(X, Y)[0, 1]

# They should be the same
print(f"Sample correlation (NumPy): {corr_manual:.2f}")
print(f"Sample correlation (Manual): {corr_numpy:.2f}")

X: [1 2 3 4 5]
Y: [5 4 3 2 1]
Sample correlation (NumPy): -1.00
Sample correlation (Manual): -1.00


The correlation coefficient ranges from -1 to 1:
- A correlation of 1 indicates a perfect positive linear relationship.
- A correlation of -1 indicates a perfect negative linear relationship.
- A correlation of 0 indicates no linear relationship.

In [6]:
# Example of a perfect positive correlation
print(f"Perfect Positive Correlation: {np.corrcoef(X, X)[0, 1]:.1f}")

# Example of a perfect negative correlation
print(f"Perfect Negative Correlation: {np.corrcoef(X, -X)[0, 1]:.1f}")

# Example of a perfect zero correlation
Z = np.array([1, 2, 1, 2, 1])
print(f"Zero Correlation: {np.corrcoef(X, Z)[0, 1]:.1f}")

Perfect Positive Correlation: 1.0
Perfect Negative Correlation: -1.0
Zero Correlation: 0.0


**Why Correlation is Preferred Over Covariance**

While covariance indicates the direction of the linear relationship between variables, it does not provide information about the strength of the relationship. Additionally, the magnitude of covariance is affected by the scale of the variables, making it difficult to compare covariances across different datasets.

Correlation, on the other hand, is a dimensionless measure that standardizes the covariance by dividing it by the product of the standard deviations of the variables. This standardization makes correlation a more interpretable and comparable measure of the linear relationship between variables.

In [7]:
# Sample data
X_scaled = X * 1000
Y_scaled =  Y * 1000

# Show samples
print(f"X scaled: {X_scaled}")
print(f"Y scaled: {Y_scaled}")

# Calculate the covariance and correlation
cov_scaled = np.cov(X_scaled, Y_scaled, ddof=1)[0, 1]
corr_scaled = np.corrcoef(X_scaled, Y_scaled)[0, 1]

# Covariance changes significantly when changing the scale of the data, while correlation remains the same
print(f"Covariance Scaled: {cov_scaled}")
print(f"Correlation Scaled: {corr_scaled:.1f}")

X scaled: [1000 2000 3000 4000 5000]
Y scaled: [5000 4000 3000 2000 1000]
Covariance Scaled: -2500000.0
Correlation Scaled: -1.0
