### 1. Types of Variables

#### Categorical Variables

**Nominal Variables**
- **Definition**: Nominal variables, also known as categorical variables, are used for labeling or naming attributes without any quantitative value. The categories do not have a logical order or ranking.
- **Examples**: 
  - **Gender**: Male, Female, Non-binary
  - **Marital Status**: Single, Married, Divorced, Widowed
  - **Country**: USA, Canada, Mexico, Brazil
  - **Favorite Color**: Red, Blue, Green, Yellow

**Ordinal Variables**
- **Definition**: Ordinal variables are categorical variables with a clear, meaningful order or ranking among the categories. However, the differences between the ranks are not measurable or consistent.
- **Examples**: 
  - **Movie Ratings**: Poor, Fair, Good, Very Good, Excellent
  - **Education Level**: High School, Bachelor’s, Master’s, PhD
  - **Socioeconomic Status**: Low, Middle, High
  - **Pain Severity**: None, Mild, Moderate, Severe

#### Numerical Variables

Numerical variables can be divided into discrete and continuous variables, and further into interval and ratio scales.

**Discrete Variables**
- **Definition**: Discrete variables represent countable quantities and take on a finite or countably infinite set of values. Each value is distinct and separate.
- **Examples**:
  - **Number of Children**: 0, 1, 2, 3
  - **Number of Cars in a Parking Lot**: 5, 10, 15
  - **Number of Students in a Class**: 20, 25, 30

**Continuous Variables**
- **Definition**: Continuous variables represent measurable quantities and can take on infinitely many values within a given range. The values can be any number, including fractions and decimals.
- **Examples**:
  - **Height**: 160.5 cm, 172.3 cm
  - **Weight**: 65.2 kg, 70.5 kg
  - **Temperature**: 36.5°C, 98.6°F

**Interval Scale**
- **Definition**: Interval variables are continuous variables where the differences between values are meaningful. However, there is no true zero point, and ratios are not meaningful.
- **Examples**:
  - **Temperature in Celsius or Fahrenheit**: The difference between 20°C and 30°C is the same as between 30°C and 40°C, but 0°C is not the absence of temperature.
  - **Calendar Years**: The difference between the years 2000 and 2010 is the same as between 1990 and 2000, but there is no absolute zero year.

**Ratio Scale**
- **Definition**: Ratio variables are continuous variables with a true zero point, meaning zero indicates the absence of the quantity. Both differences and ratios between values are meaningful.
- **Examples**:
  - **Height**: A height of 0 cm means no height. Ratios like "twice as tall" are meaningful.
  - **Weight**: A weight of 0 kg means no weight. Ratios like "half as heavy" are meaningful.
  - **Income**: An income of $0 means no income. Comparing incomes as ratios is meaningful.

### Summary Table

| Type         | Definition                                                                 | Examples                                          |
|--------------|-----------------------------------------------------------------------------|---------------------------------------------------|
| Nominal      | Categories without a logical order or ranking                              | Gender, Marital Status, Country, Favorite Color   |
| Ordinal      | Categories with a meaningful order or ranking                              | Movie Ratings, Education Level, Socio-economic Status, Pain Severity |
| Discrete     | Countable quantities with distinct and separate values                     | Number of Children, Number of Cars, Number of Students |
| Continuous   | Measurable quantities with infinitely many possible values                 | Height, Weight, Temperature                       |
| Interval     | Continuous variables with meaningful differences but no true zero point    | Temperature (Celsius/Fahrenheit), Calendar Years  |
| Ratio        | Continuous variables with meaningful differences and a true zero point     | Height, Weight, Income                            |

This detailed breakdown provides a clear understanding of different types of variables, which is essential for accurate data analysis and interpretation in data science.

# Measures of central tendency

These measures describe the center or typical value of a dataset and are fundamental in statistics. We'll cover:

1. **Mean**
2. **Median**
3. **Mode**

For each measure, I'll provide definitions, formulations, and calculation examples.

### 1. Mean

#### Definition
The mean, often referred to as the average, is the sum of all data points divided by the number of data points. It provides a central value of the dataset.

#### Formulation
- **Population Mean ($\mu$)**: Used when considering the entire population.
  $$
  \mu = \frac{1}{N} \sum_{i=1}^{N} x_i
  $$
  Where $N$ is the total number of observations and $x_i$ are the individual data points.

- **Sample Mean ($\bar{x}$)**: Used when considering a sample from the population.
  $$
  \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
  $$
  Where $n$ is the total number of observations in the sample.

#### Calculation Examples
- **Example 1 (Population Mean)**:
  Dataset: $2, 4, 6, 8, 10$
  $$
  \mu = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6
  $$

- **Example 2 (Sample Mean)**:
  Sample Dataset: $3, 7, 5$
  $$
  \bar{x} = \frac{3 + 7 + 5}{3} = \frac{15}{3} = 5
  $$

### 2. Median

#### Definition
The median is the middle value of a dataset when it is ordered in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle values.

#### Formulation
- For an **odd** number of observations:
  $$
  \text{Median} = x_{\left(\frac{n+1}{2}\right)}
  $$

- For an **even** number of observations:
  $$
  \text{Median} = \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}}{2}
  $$

#### Calculation Examples
- **Example 1 (Odd Number of Observations)**:
  Dataset: $3, 1, 4, 2, 5$ (sorted: $1, 2, 3, 4, 5$)
  $$
  \text{Median} = 3 \quad (\text{3rd observation})
  $$

- **Example 2 (Even Number of Observations)**:
  Dataset: $7, 1, 3, 5$ (sorted: $1, 3, 5, 7$)
  $$
  \text{Median} = \frac{3 + 5}{2} = 4
  $$

### 3. Mode

#### Definition
The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all.

#### Calculation Examples
- **Example 1 (Single Mode)**:
  Dataset: $2, 3, 4, 4, 5$
  $$
  \text{Mode} = 4
  $$

- **Example 2 (Multiple Modes)**:
  Dataset: $1, 1, 2, 3, 3$
  $$
  \text{Mode} = 1 \text{ and } 3
  $$

- **Example 3 (No Mode)**:
  Dataset: $1, 2, 3, 4$
  $$
  \text{Mode} = \text{None}
  $$

### Summary

| Measure | Definition | Formula | Calculation Example |
|---------|------------|---------|---------------------|
| **Mean** | Average of all data points | $\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$ (Population) $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$ (Sample) | Dataset: $2, 4, 6, 8, 10$; Mean: $6$ |
| **Median** | Middle value when data is ordered | Odd: $x_{\left(\frac{n+1}{2}\right)}$ Even: $\frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}}{2}$ | Dataset: $3, 1, 4, 2, 5$; Median: $3$ |
| **Mode** | Most frequently occurring value | N/A | Dataset: $2, 3, 4, 4, 5$; Mode: $4$ |

These measures provide different insights into the central tendency of a dataset. The mean is useful for datasets without extreme values, the median is more robust to outliers, and the mode is helpful for understanding the most common value(s). Each measure has its applications depending on the nature of the data and the analysis goals.


In [None]:
 import numpy as np
from scipy import stats

# Example data sets
dataset_population = [2, 4, 6, 8, 10]
dataset_sample = [3, 7, 5]
dataset_odd = [3, 1, 4, 2, 5]
dataset_even = [7, 1, 3, 5]
dataset_mode_single = [2, 3, 4, 4, 5]
dataset_mode_multiple = [1, 1, 2, 3, 3]
dataset_no_mode = [1, 2, 3, 4]

In [None]:
# Mean calculation
mean_population = np.mean(dataset_population)
mean_sample = np.mean(dataset_sample)
print(f"Population Mean: {mean_population}")
print(f"Sample Mean: {mean_sample}")

In [None]:
# Median calculation
median_odd = np.median(dataset_odd)
median_even = np.median(dataset_even)
print(f"Median (Odd dataset): {median_odd}")
print(f"Median (Even dataset): {median_even}")

In [None]:
# Mode calculation
mode_single = stats.mode(dataset_mode_single)
mode_multiple = stats.mode(dataset_mode_multiple)
mode_no_mode = stats.mode(dataset_no_mode)

print(f"Mode (Single mode dataset): {mode_single.mode} with count {mode_single.count}")
print(f"Mode (Multiple modes dataset): {mode_multiple.mode} with counts {mode_multiple.count}")

In [None]:
print(f"Mode (No mode dataset): {mode_no_mode.mode} with count {mode_no_mode.count}")

# Measures of spread (or dispersion)

These measures include:

1. **Range**
2. **Variance**
3. **Standard Deviation**
4. **Interquartile Range (IQR)**

### 1. Range

#### Definition
The range is the difference between the maximum and minimum values in a dataset. It provides a simple measure of the spread but is highly sensitive to outliers.

#### Formulation
$$
\text{Range} = \max(x) - \min(x)
$$

#### Calculation Example
- Dataset: $3, 5, 7, 2, 9$
  $$
  \text{Range} = 9 - 2 = 7
  $$

### 2. Variance

#### Definition
Variance measures the average squared deviation of each data point from the mean. It quantifies the spread of the data points.

#### Formulation
- **Population Variance ($\sigma^2$)**:
  $$
  \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
  $$
  Where $N$ is the total number of observations, $x_i$ are the individual data points, and $\mu$ is the population mean.

- **Sample Variance ($s^2$)**:
  $$
  s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
  $$
  Where $n$ is the total number of observations in the sample, $x_i$ are the individual data points, and $\bar{x}$ is the sample mean.

#### Calculation Example
- Dataset: $4, 8, 6, 5, 3$
  $$
  \bar{x} = \frac{4 + 8 + 6 + 5 + 3}{5} = \frac{26}{5} = 5.2
  $$
  $$
  s^2 = \frac{(4-5.2)^2 + (8-5.2)^2 + (6-5.2)^2 + (5-5.2)^2 + (3-5.2)^2}{5-1}
  $$
  $$
  s^2 = \frac{1.44 + 7.84 + 0.64 + 0.04 + 4.84}{4} = \frac{14.8}{4} = 3.7
  $$

### 3. Standard Deviation

#### Definition
The standard deviation is the square root of the variance. It provides a measure of spread in the same units as the data.

#### Formulation
$$
\sigma = \sqrt{\sigma^2}
$$

#### Calculation Example
- Dataset: $4, 8, 6, 5, 3$
  $$
  s = \sqrt{3.7} \approx 1.92
  $$

### 4. Interquartile Range (IQR)

#### Definition
The IQR is the range of the middle 50% of the data. It is the difference between the first quartile (Q1) and the third quartile (Q3).

#### Formulation
$$
\text{IQR} = Q3 - Q1
$$

#### Calculation Example
- Dataset: $1, 3, 5, 7, 9, 11, 13$
  $$
  Q1 = 3 \quad (\text{25th percentile})
  $$
  $$
  Q3 = 11 \quad (\text{75th percentile})
  $$
  $$
  \text{IQR} = 11 - 3 = 8
  $$

In [None]:
import numpy as np

# Example datasets
dataset_range = [3, 5, 7, 2, 9]
dataset_variance = [4, 8, 6, 5, 3]
dataset_iqr = [1, 3, 5, 7, 9, 11, 13]

In [None]:
# Range calculation
range_value = np.max(dataset_range) - np.min(dataset_range)
print(f"Range: {range_value}")

In [None]:
# Variance calculation (Sample)
variance_sample = np.var(dataset_variance, ddof=1)
print(f"Sample Variance: {variance_sample}")

In [None]:
# Standard deviation calculation (Sample)
std_deviation_sample = np.std(dataset_variance, ddof=1)
print(f"Sample Standard Deviation: {std_deviation_sample}")

In [None]:
# IQR calculation
Q1 = np.percentile(dataset_iqr, 25)
Q3 = np.percentile(dataset_iqr, 75)
IQR = Q3 - Q1
print(f"IQR: {IQR}")

# Measures of shape

They describe the distribution's symmetry and the peakedness or flatness of the data. These measures include:

1. **Skewness**
2. **Kurtosis**

### 1. Skewness

#### Definition
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. It indicates whether the data points are skewed to the left (negative skew) or to the right (positive skew).

  - A skewness of 0 indicates a symmetrical distribution.
  - Positive or negative skewness values indicate right or left skewed distributions, respectively.
  
- **Positive Skewness**: The right tail is longer or fatter. The mass of the distribution is concentrated on the left.
- **Negative Skewness**: The left tail is longer or fatter. The mass of the distribution is concentrated on the right.

#### Formulation
The skewness ($S$) can be calculated using the following formula:
$$
S = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3
$$
Where:
- $n$ is the number of observations.
- $x_i$ are the individual data points.
- $\bar{x}$ is the sample mean.
- $s$ is the sample standard deviation.



In [None]:
from scipy.stats import skew, kurtosis

# Example dataset
dataset_shape = [3, 5, 7, 9, 11, 13, 15]

# Skewness calculation
skewness = skew(dataset_shape)
print(f"Skewness: {skewness}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, skewnorm
import scipy

# Generate data for a normal distribution
np.random.seed(0)
normal_data = norm.rvs(size=1000)

# Generate data for a right-skewed distribution
right_skewed_data = skewnorm.rvs(a=5, size=1000)

# Generate data for a left-skewed distribution
left_skewed_data = skewnorm.rvs(a=-5, size=1000)

# Plot the distributions
fig, ax = plt.subplots(3, 1, figsize=(6, 8))

# Normal distribution
ax[0].hist(normal_data, bins=30, density=True, alpha=0.6, color='g', label=f"Skewness: {normal_skewness.round(3)}")
xmin, xmax = ax[0].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, 0, 1)
ax[0].plot(x, p, 'k', linewidth=2)
ax[0].set_title('Normal Distribution')
ax[0].legend()


# Right-skewed distribution
ax[1].hist(right_skewed_data, bins=30, density=True, alpha=0.6, color='b')
xmin, xmax = ax[1].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = skewnorm.pdf(x, 5, loc=0, scale=1)
ax[1].plot(x, p, 'k', linewidth=2)
ax[1].set_title('Right-Skewed Distribution')

# Left-skewed distribution
ax[2].hist(left_skewed_data, bins=30, density=True, alpha=0.6, color='r')
xmin, xmax = ax[2].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = skewnorm.pdf(x, -5, loc=0, scale=1)
ax[2].plot(x, p, 'k', linewidth=2)
ax[2].set_title('Left-Skewed Distribution')

plt.tight_layout()
plt.show()

# Calculate skewness for each distribution
from scipy.stats import skew

normal_skewness = skew(normal_data)
right_skewed_skewness = skew(right_skewed_data)
left_skewed_skewness = skew(left_skewed_data)

print(f"Normal Distribution Skewness: {normal_skewness}")
print("Mean:", normal_data.mean().round(3), "Std Dev:", np.std(normal_data).round(3), "Median:", np.median(normal_data).round(3))

print(f"\nRight-Skewed Distribution Skewness: {right_skewed_skewness}")
print("Mean:", right_skewed_skewness.mean().round(3), "Std Dev:", np.std(right_skewed_skewness).round(3), "Median:", np.median(right_skewed_skewness).round(3))

print(f"\nLeft-Skewed Distribution Skewness: {left_skewed_skewness}")
print("Mean:", left_skewed_skewness.mean().round(3), "Std Dev:", np.std(left_skewed_skewness).round(3), "Median:", np.median(left_skewed_skewness).round(3))

## Examples from real world

### Normal Distribution
**Example**: Heights of Adult Males
- In a large population, the heights of adult males tend to follow a normal distribution. Most people have heights around the mean, with fewer individuals being extremely short or tall.

**Example**: Test Scores
- Standardized test scores (e.g., SAT, IQ tests) often follow a normal distribution, where most students score near the average, with fewer students achieving extremely high or low scores.

### Right-Skewed Distribution
**Example**: Income Distribution
- Income distribution within a population is often right-skewed, where a large number of people earn lower incomes, and a smaller number of people earn significantly higher incomes.

**Example**: Age at Retirement
- The age at which people retire tends to be right-skewed. Most people retire around a typical retirement age, but some retire much later, creating a longer tail to the right.

### Left-Skewed Distribution
**Example**: Age at Death in Certain Populations
- In some populations, especially those with good healthcare, the age at death might be left-skewed. Most individuals live to an old age, but a smaller number die young due to accidents or illness.

**Example**: Time to Complete a Task
- If a task is generally easy but occasionally encounters delays (e.g., routine tasks with rare interruptions), the completion time may be left-skewed. Most times will be short, with a few longer times.

### 2. Kurtosis

#### Definition
Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. It indicates whether the data points are more or less outlier-prone (heavier or lighter tails) than a normal distribution.

- **Leptokurtic**: Kurtosis > 3. Distribution with heavier tails and a sharper peak.
- **Mesokurtic**: Kurtosis = 3. Distribution similar to a normal distribution.
- **Platykurtic**: Kurtosis < 3. Distribution with lighter tails and a flatter peak.

#### Formulation
The kurtosis ($K$) can be calculated using the following formula:
$$
K = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
$$
Where:
- $n$ is the number of observations.
- $x_i$ are the individual data points.
- $\bar{x}$) is the sample mean.
- $s$ is the sample standard deviation.


In [None]:
from scipy.stats import skew, kurtosis

# Example dataset
dataset_shape = [3, 5, 7, 9, 11, 13, 15]

# Kurtosis calculation
kurt = kurtosis(dataset_shape)
print(f"Kurtosis: {kurt}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, laplace, t

# Generate data for mesokurtic distribution (Normal distribution)
np.random.seed(0)
mesokurtic_data = norm.rvs(size=1000)

# Generate data for leptokurtic distribution (Laplace distribution)
leptokurtic_data = laplace.rvs(size=1000)

# Generate data for platykurtic distribution (Uniform distribution)
platykurtic_data = np.random.uniform(low=-1, high=1, size=1000)

# Plot the distributions
fig, ax = plt.subplots(3, 1, figsize=(6,8))

# Mesokurtic distribution
ax[0].hist(mesokurtic_data, bins=30, density=True, alpha=0.6, color='g')
xmin, xmax = ax[0].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, 0, 1)
ax[0].plot(x, p, 'k', linewidth=2)
ax[0].set_title('Mesokurtic Distribution (Normal)')

# Leptokurtic distribution
ax[1].hist(leptokurtic_data, bins=30, density=True, alpha=0.6, color='b')
xmin, xmax = ax[1].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = laplace.pdf(x, 0, 1)
ax[1].plot(x, p, 'k', linewidth=2)
ax[1].set_title('Leptokurtic Distribution (Laplace)')

# Platykurtic distribution
ax[2].hist(platykurtic_data, bins=30, density=True, alpha=0.6, color='r')
xmin, xmax = ax[2].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = [1/2]*100
ax[2].plot(x, p, 'k', linewidth=2)
ax[2].set_title('Platykurtic Distribution (Uniform)')

plt.tight_layout()
plt.show()

# Calculate kurtosis for each distribution
from scipy.stats import kurtosis

mesokurtic_kurtosis = kurtosis(mesokurtic_data)
leptokurtic_kurtosis = kurtosis(leptokurtic_data)
platykurtic_kurtosis = kurtosis(platykurtic_data)

print(f"Mesokurtic Distribution Kurtosis: {mesokurtic_kurtosis}")
print(f"Leptokurtic Distribution Kurtosis: {leptokurtic_kurtosis}")
print(f"Platykurtic Distribution Kurtosis: {platykurtic_kurtosis}")


## Examples from real world
### Leptokurtic Distribution
**Example**: Stock Market Returns During Crises
- During financial crises, stock market returns can become leptokurtic, showing sharper peaks and heavier tails than normal distributions, indicating frequent extreme gains or losses.

**Example**: Daily Rainfall Amounts in Certain Climates
- In some climates, daily rainfall amounts can be leptokurtic, with many days of little to no rain and a few days with very heavy rain.

### Mesokurtic Distribution
**Example**: Standardized Test Scores
- As mentioned earlier, standardized test scores like IQ tests are designed to be mesokurtic, resembling the normal distribution with moderate tails and peak.

**Example**: Measurement Errors in Physical Experiments
- Measurement errors in well-controlled physical experiments often follow a mesokurtic distribution, assuming they are normally distributed.

### Platykurtic Distribution
**Example**: Uniformly Distributed Data
- Data that is uniformly distributed (e.g., rolling a fair die) is platykurtic, as it has lighter tails and a flatter peak than a normal distribution.

**Example**: Heights of a Uniformly Selected Population
- If a population is artificially selected to be uniformly distributed in height (e.g., selecting individuals within a certain height range evenly), the height distribution would be platykurtic.



In [None]:
pwd