## Descriptive Statistic of the Data

is a way to describe the data in a meaningful way. It is a way to describe the data in a way that is understandable.

### Scales of Measurement

- Nominal (Categorical)
    - Example: [dog, cat, bird]
- Ordinal (Categorical with order)
    - Example: [small, medium, large]
- Interval (Numerical, equal intervals)
    - Example: [ $5^{\circ}C$, $10^{\circ}C$, $15^{\circ}C$, $20^{\circ}C$ ]
- Ratio (Numerical, equal intervals, absolute zero)
    - Example: [ $0^{\circ}K$, $5^{\circ}K$, $10^{\circ}K$, $15^{\circ}K$ ]

$\green{\text{all scales of measurement can be converted to upper scales of measurement,}}\red{\text{ but not vice versa}}$


### Population vs Sample

- Population: the entire set of items from which data can be selected
    - Example: Humans with cancer
- Sample: a subset of the population
    - Example: 100 humans with cancer

### Parameters

-   **Mean** is the average of the data. It is the sum of all the data divided by the number of data points. It is the center of the data.
    - $\mu = \frac{\sum_{i=1}^{n} x_i}{n}$
-   **Modus** is the most frequent data point in the data.
    - $modus = \frac{1}{n} \sum_{i=1}^{n} x_i$
-   **Median** is the middle data point in the data. It is the data point that is in the middle of the data when the data is sorted.
    - $median = \frac{1}{n} \sum_{i=1}^{n} x_i$

In [14]:
def mean(array):
    return sum(array) / len(array)

def median(array):
    array.sort()
    if len(array) % 2 == 0:
        return (array[len(array) // 2] + array[len(array) // 2 - 1]) / 2
    else:
        return array[len(array) // 2]
    
def mode(array):
    array.sort()
    freq = {}
    for i in array:
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1
    mode = max(freq, key = freq.get)
    return mode

2


### Frequency distribution

-   **Frequency** is the number of times a data point appears in the data.
    - **Frequency distribution** is a table that shows the frequency of each data point in the data.
    1. Understanding the general structure of the data
    2. Determination most frequent data points
    3. Comparing frequencies of different data points or groups in the data
    - Example: **[1, 1, 2, 4, 24, 58, 2, 24, 1, 78]** has a frequency distribution of **[1: 3, 2: 2, 4: 1, 24: 2, 58: 1, 78: 1]**
        | Data | Frequency |
        | ---- | --------- |
        | 1    | 3         |
        | 2    | 2         |
        | 4    | 1         |
        | 24   | 2         |
        | 58   | 1         |
        | 78   | 1         |
-   **Relative frequency** is the frequency of a data point divided by the total number of data points.
    - **Relative frequency distribution** is a table that shows the relative frequency of each data point in the data.
    1. Comparing distribution of data points between different data sets
    2. Normalization of data not related to the size of the data set
    3. Interpretations of the significance of certain observations in the context of the data set
    - Example: **[1, 1, 2, 4, 24, 58, 2, 24, 1, 78]** has a relative frequency distribution of **[1: 0.3, 2: 0.2, 4: 0.1, 24: 0.2, 58: 0.1, 78: 0.1]**
        | Data | Relative Frequency |
        | ---- | ------------------ |
        | 1    | 0.3                |
        | 2    | 0.2                |
        | 4    | 0.1                |
        | 24   | 0.2                |
        | 58   | 0.1                |
        | 78   | 0.1                |

In [9]:
def absolute_frequency(array):
    freq = {}
    for i in array:
        if i in freq:
            freq[i] += 1
        else:
            freq[i] = 1
    return freq

def relative_frequency(array):
    freq = absolute_frequency(array)
    for i in freq:
        freq[i] /= len(array)
    return freq

### Scattering parameters

they are used to describe the spread of the data.

-  **Range** is the difference between the largest and smallest data point in the data.
    - $range = max(data) - min(data)$
-  **MAD** is the mean of the absolute difference between each data point and the mean of the data.
    - $MAD = \frac{\sum_{i=1}^{n} |x_i - \mu|}{n}$
-  **Sum of Squared Deviations** is the sum of the squared difference between each data point and the mean of the data.
    - $SSD = \sum_{i=1}^{n} (x_i - \mu)^2$

    $\fbox{\text{MAD is better than SSD because it is not affected by the outliers}}$
    $\fbox{\text{SSD is better where sensitivity to changes in data is required}}$

-  **Variance** is the average of the squared difference between each data point and the mean of the data.
    - $\sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}$ (population variance)
    - $S^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n-1}$ (sample variance)
    - $\mu$ is the mean of the data
    - $n-1$ is the degrees of freedom, compensation potential bias in the sample variance because the sample is less than the population

In [None]:
def range(array):
    return max(array) - min(array)

def MAD(array):
    len = len(array)
    mean = sum(array) / len
    return sum(abs(array[i] - mean) for i in range(len)) / len

def SSD(array):
    len = len(array)
    mean = sum(array) / len
    return sum((array[i] - mean) ** 2 for i in range(len))

def variance(array, mode = 'population'):
    len = len(array)
    mean = sum(array) / len
    if mode == 'population':
        return sum((array[i] - mean) ** 2 for i in range(len)) / len
    elif mode == 'sample':
        return sum((array[i] - mean) ** 2 for i in range(len)) / (len - 1)