Measures of dispersion are statistical tools used to describe the spread or variability of a dataset. They indicate how much the data points differ from the central value (such as the mean or median) and from each other. Common measures of dispersion include:

- **Range**: The difference between the maximum and minimum values in the dataset.
- **Variance**: The average of the squared differences between each data point and the mean.
- **Standard Deviation**: The square root of the variance, representing the average distance of data points from the mean.
- **Interquartile Range (IQR)**: The difference between the third quartile (Q3) and the first quartile (Q1), showing the spread of the middle 50% of the data.

These measures help to understand the consistency, reliability, and variability of the data.

In [1]:
import pandas as pd	
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('data.csv')

The variable `df` is a Pandas DataFrame containing a dataset with 38,531 rows and 30 columns. Here's a summary of its structure and content:

- **Rows**: 38,531 entries, indexed from 0 to 38,530.
- **Columns**: 30 columns with various data types:
    - **Object (string)**: 10 columns (e.g., `manufacturer_name`, `model_name`, `color`).
    - **Integer (int64)**: 5 columns (e.g., `odometer_value`, `year_produced`).
    - **Float (float64)**: 2 columns (e.g., `engine_capacity`, `price_usd`).
    - **Boolean (bool)**: 13 columns (e.g., `engine_has_gas`, `has_warranty`).
    
### Key Columns:
- **manufacturer_name**: Name of the car manufacturer.
- **model_name**: Model of the car.
- **transmission**: Type of transmission (e.g., automatic, mechanical).
- **color**: Color of the car.
- **odometer_value**: Distance the car has traveled (in kilometers).
- **year_produced**: Year the car was manufactured.
- **engine_fuel**: Type of fuel used by the engine (e.g., gasoline, diesel).
- **engine_capacity**: Engine capacity in liters.
- **body_type**: Type of car body (e.g., sedan, SUV).
- **price_usd**: Price of the car in USD.
- **duration_listed**: Number of days the car has been listed for sale.

### Additional Details:
- Some columns, like `engine_capacity`, have a few missing values.
- The dataset includes categorical, numerical, and boolean data.
- Memory usage: Approximately 5.5 MB.

This dataset appears to be related to car listings, with details about the cars' specifications, conditions, and sale information.


In [2]:
df['price_usd'].std()

np.float64(6428.1520182029035)

```markdown
### Range

The **range** is a measure of dispersion that represents the difference between the maximum and minimum values in a dataset. It provides a simple way to understand the spread of the data.

For the dataset in the variable `df`, the range can be calculated for numerical columns. For example:

- **Price (price_usd)**: The range of car prices can be calculated as:
    \[
    \text{Range} = \text{Maximum Price} - \text{Minimum Price}
    \]

- **Odometer Value (odometer_value)**: The range of odometer readings can be calculated similarly:
    \[
    \text{Range} = \text{Maximum Odometer Value} - \text{Minimum Odometer Value}
    \]

The range is useful for understanding the variability in the dataset but can be sensitive to outliers, as it only considers the extreme values.
```

In [3]:
# Range = max_value - min_value
rango = df['price_usd'].max() - df['price_usd'].min()
print(f"Range: {rango}")

Range: 49999.0


### Quartiles

Quartiles are statistical measures that divide a dataset into four equal parts, each containing 25% of the data points. They are used to summarize the distribution of a dataset and provide insights into its spread and central tendency. The three main quartiles are:

1. **First Quartile (Q1)**: Also known as the lower quartile, it represents the 25th percentile of the data. This means that 25% of the data points are below Q1, and 75% are above it.

2. **Second Quartile (Q2)**: Also known as the median, it represents the 50th percentile of the data. This is the middle value of the dataset, where 50% of the data points are below Q2, and 50% are above it.

3. **Third Quartile (Q3)**: Also known as the upper quartile, it represents the 75th percentile of the data. This means that 75% of the data points are below Q3, and 25% are above it.

### Interquartile Range (IQR)

The **Interquartile Range (IQR)** is a measure of dispersion that represents the range of the middle 50% of the data. It is calculated as:
\[
\text{IQR} = Q3 - Q1
\]

The IQR is useful for identifying outliers, as data points that fall below \( Q1 - 1.5 \times \text{IQR} \) or above \( Q3 + 1.5 \times \text{IQR} \) are often considered outliers.

### Example

For the dataset in the variable `df`, the quartiles can be calculated for numerical columns like `price_usd` or `odometer_value`. For instance:

- **Q1**: The value below which 25% of the car prices fall.
- **Q2 (Median)**: The middle value of the car prices.
- **Q3**: The value below which 75% of the car prices fall.

These quartiles provide a deeper understanding of the distribution of car prices or other numerical features in the dataset.