# Statistics_For_Data_Analysis_Part3

# Measures of dispersion: (spread or variability)
#### Range, Variance, Standard Deviation, Percentiles, and Quartiles

Certainly! Here are examples and explanations for each of the measures of dispersion you mentioned:

1. Range:
   - Example: Consider the following set of numbers: 10, 15, 20, 25, 30.
   - To find the range, subtract the smallest number from the largest number.
   - Range = 30 - 10 = 20.

2. Interquartile Range (IQR):
   - Example: Consider the following set of numbers: 5, 7, 9, 12, 15, 18, 22, 25, 30.
   - First, arrange the numbers in ascending order: 5, 7, 9, 12, 15, 18, 22, 25, 30.
   - Calculate the first quartile (Q1) and third quartile (Q3).
   - IQR = Q3 - Q1.

3. Quartile:
   - Example: Consider the same set of numbers: 5, 7, 9, 12, 15, 18, 22, 25, 30.
   - Quartiles divide a dataset into four equal parts.
   - Q1 is the median of the lower half of the dataset, Q2 is the median of the entire dataset, and Q3 is the median of the upper half of the dataset.

4. Percentile:
   - Example: Consider the same set of numbers: 5, 7, 9, 12, 15, 18, 22, 25, 30.
   - Percentiles divide a dataset into hundred equal parts.
   - The nth percentile represents the value below which n% of the data falls.

5. Variance:
   - Example: Consider the same set of numbers: 10, 15, 20, 25, 30.
   - To find the variance, first, calculate the mean of the dataset. Then, subtract the mean from each number, square the result, and find the average of those squared differences.

6. Standard Deviation:
   - Example: Consider the same set of numbers: 10, 15, 20, 25, 30.
   - To find the standard deviation, take the square root of the variance.

These measures of dispersion help to understand the spread or variability of a dataset.

# 1. Range = (Max - Min)

Here's a breakdown of your points:

1. **Sensitive to Extreme Values**: You rightly highlight that the range can be greatly influenced by extreme values or outliers in the dataset. Because it only considers the maximum and minimum values, outliers can skew the range and lead to a distorted view of the data's dispersion.

2. **Does Not Consider the Distribution**: You point out that the range does not take into account the distribution of the data. This limitation is important to note because the range alone may not provide a comprehensive understanding of how the data is spread across its range.

3. **Suitable for Small Datasets**: You mention that the range is suitable for small datasets. This observation is accurate because larger datasets may have more variability, making the range less informative as a measure of spread.

## Range Examples:

In [1]:
# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# Range
range_val = max(numbers) - min(numbers)

print("Range:", range_val)


Range: 20


# 2. Interquartile Range (IQR):

The interquartile range (IQR) is a measure of statistical dispersion, specifically a measure of the spread of the middle 50% of a dataset. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1), which represents the range within which the middle 50% of the data values lie.

Since the IQR focuses on the middle portion of the data distribution, it is less sensitive to extreme values or outliers compared to the range, mean, or standard deviation. This makes it a robust measure of spread, particularly in datasets where extreme values may skew the results.

By using the IQR, analysts can better understand the variability of the central part of the dataset while minimizing the influence of extreme observations. It provides valuable information about the dispersion of the data without being unduly influenced by outliers.

Overall, your statement accurately summarizes the concept of the interquartile range and highlights its effectiveness in capturing the spread of the central portion of a dataset while mitigating the impact of extreme values.

## Interquartile Range (IQR) Example 

In [2]:
import numpy as np

# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# IQR
q75, q25 = np.percentile(numbers, [75, 25])
iqr = q75 - q25

print("Interquartile Range (IQR):", iqr)


Interquartile Range (IQR): 10.0


# 3. Quartile:

Quartiles are values that divide a dataset into four equal parts, each containing approximately 25% of the data. These quartiles are denoted as Q1, Q2, and Q3. Q2, also known as the median, divides the data into two equal halves. Q1 divides the lower 25% of the data from the upper 75%, while Q3 divides the upper 25% of the data from the lower 75%.

Quartiles are particularly useful in descriptive statistics and exploratory data analysis to understand the spread and distribution of numerical data. They provide insights into the central tendency and dispersion of the dataset, especially when combined with other measures such as the median and interquartile range.

Overall,  statement effectively captures the essence of quartiles in dividing data into four equal parts.

## Quartile Example:

In [3]:
import numpy as np

# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# Quartiles
q1 = np.percentile(numbers, 25)
q2 = np.percentile(numbers, 50)
q3 = np.percentile(numbers, 75)

print("Quartiles:")
print("Q1:", q1)
print("Q2:", q2)
print("Q3:", q3)


Quartiles:
Q1: 15.0
Q2: 20.0
Q3: 25.0


# 4. Percentile :

Percentiles are values that divide a dataset into 100 equal parts, representing the relative standing of an observation within the dataset. Each percentile corresponds to a specific percentage of the data.

For example, the 50th percentile, also known as the median, represents the value below which 50% of the data falls. Similarly, the 25th percentile (Q1) represents the value below which 25% of the data falls, and the 75th percentile (Q3) represents the value below which 75% of the data falls.

Percentiles are commonly used in statistics to understand the distribution of numerical data and identify outliers or extreme values. They provide insights into the relative position of individual observations within a dataset, allowing for comparisons and analysis across different percentiles.

Overall, your statement effectively captures the essence of percentiles in dividing data into 100 equal parts, each representing a specific percentage of the dataset.
## Percentile Example:

In [4]:
import numpy as np

# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# Percentile
percentile_90 = np.percentile(numbers, 90)

print("90th Percentile:", percentile_90)


90th Percentile: 28.0


# 5. Variance :

The calculation of variance involves several steps, but it doesn't directly involve squaring the mean. Here's how variance is typically calculated:

1. **Calculate the Mean**: First, you compute the mean (average) of the dataset.

2. **Find the Difference from the Mean**: Next, you find the difference between each data point and the mean.

3. **Square the Differences**: You square each of these differences to ensure they are positive values and to emphasize deviations from the mean.

4. **Calculate the Average of the Squared Differences**: After squaring the differences, you calculate the average of these squared differences. This value represents the variance.

So, while squaring is involved, it's not the mean that's squared; rather, it's the deviations from the mean that are squared and averaged to obtain the variance.

Here's the formula for variance (assuming a sample):

![image.png](attachment:66a87966-bdef-44e0-a6fb-209dad9b6f54.png)

Where:
- \( n \) is the number of observations in the sample.
- \( x_i \) represents each individual observation.
- \( \bar{x} \) is the mean of the sample.

## Variance Example :

In [5]:
import numpy as np

# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# Variance
variance = np.var(numbers)

print("Variance:", variance)


Variance: 50.0


# 6. Standard Deviation:

Yes, the standard deviation is closely related to variance and is essentially the square root of the variance. While variance gives you an idea of how spread out the values in a dataset are, the standard deviation provides a more intuitive measure of this spread in the same units as the original data.

Here's how standard deviation is calculated:

1. **Calculate Variance**: First, you calculate the variance using the formula:

![image.png](attachment:a8b629e1-7786-4d70-a975-1f5779bb0b6c.png)

where \( n \) is the number of observations, \( x_i \) represents each individual observation, and \( \bar{x} \) is the mean.

2. **Take the Square Root**: Once you have the variance, you take the square root of it to obtain the standard deviation.

The standard deviation provides a measure of the dispersion or spread of the data points around the mean. A larger standard deviation indicates that the data points are more spread out from the mean, while a smaller standard deviation suggests that the data points are closer to the mean.

The formula for standard deviation is:

![image.png](attachment:f9e159dd-d587-4cc1-b9b8-766be44d26bb.png)

So, in summary, while variance gives you the average of the squared differences from the mean, the standard deviation gives you a measure of how much the values deviate from the mean in the original units of the data.
## Standard Deviation Example:

In [6]:
import numpy as np

# Example list of numbers
numbers = [10, 15, 20, 25, 30]

# Standard Deviation
std_deviation = np.std(numbers)

print("Standard Deviation:", std_deviation)


Standard Deviation: 7.0710678118654755


In [7]:
print("std_deviation:")
np.sqrt(variance)

std_deviation:


7.0710678118654755