# Statistics for Data Science & Machine Learning

This notebook contains all of the basic concepts someone should know to do Data Science and Machine learning. This notebook is based on the video made by [Derek Banas](https://www.youtube.com/@derekbanas) called [Statistics for Data Science & Machine Learning
](https://www.youtube.com/watch?v=tcusIOfI_GM&list=PLGLfVvz_LVvQy4mkmEvtFwZGg1S38MUmn).


## Basic concepts

### Definitions:

- _Population_ - all items or people of interest in what you are analyzing
- _Sample_ - subset of the _Population_ that we can analyze
- _Success_ - results we are look for in a sample (age, is car owner or not, how owner or not)

### Markups:

- _M_ - _Successes_ in _Population_
- _x_ - _Successes_ in _Sample_
- _N_ - total _Population_
- _n_ - total _Sample_ from _Population_


## Types of data

### Categorical

Any answer to a yes or no question, for example:

- Home owner
- Age 50+
- College graduate

### Numerical

This data can be:

- _Finite_ - has an ending value
- _Infinite_ - does not have an ending value

### Continuous

Data tha can be broken down into infinitely smaller amounts, for example height, weight, time.

### Qualitative

This type of data can be:

- _Nominal_ - mainly data for naming something that does not have an order (for example fruits)
- _Ordinal_ - data that does has an order (bad, ok, good, great)

### Quantitative

Mainly separated into:

- _Interval_ - is a group of numbers that includes all numbers between the beginning and the end (1, 2, 3 is an interval from 1 to 3)
- _Ratio_ - shows how many times one number contains another. For example, if there are eight oranges and six lemons in a bowl of fruit, then the ratio of oranges to lemons is eight to six (8/6)


## Charts and tools

### Pareto chart

A Pareto chart is a type of bar chart that combines both bar and line graphs. It visually represents the 80/20 rule, also known as the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. Simple explanation of the chart can be found in [this video](https://www.youtube.com/watch?v=ltBw6kwD3_o).

### Frequency distribution table

Represents a number of occurrences for a particular category.

| Category | Frequency |
| -------- | --------- |
| A        | 10        |
| B        | 15        |
| C        | 8         |
| D        | 12        |

### Frequency histogram

A histogram is a graphical representation of the distribution of a dataset. It displays the frequencies or counts of different numerical values or intervals within a given range. Simple explanation of the histogram can be found [here](https://www.youtube.com/watch?v=8TV5ha9nqm0&ab_channel=HarvardOnline).


## Calculation of the basic terms

We are going to calculate basic things for the test sample (see below).

In [10]:
sample = [10, 20, 15, 10, 15, 15, 10, 22, 32]

### Mean (average)

It's basically an average sum of values of components divided by the number of components.

- _μ_ - _Mean_ of _Population_
- _x̄_ - _Mean_ of the _Sample_

In [11]:
def calculate_mean(sample):
    """Calculate the mean of a sample."""
    return sum(sample) / len(sample)

mean = calculate_mean(sample)
print(f"Mean of the sample equals to: {mean}")

Mean of the sample equals to: 16.555555555555557


### Median

The value in the middle of the dataset.

In [12]:
def calculate_median(sample):
    """
    Calculate the median of a sample.
    Single middle value if the sample has an odd number of elements.
    Average of two middle values if the sample has an even number of elements.
    """
    sample.sort()

    if len(sample) % 2 == 0:
        median = (sample[len(sample) // 2 - 1] + sample[len(sample) // 2]) / 2
    else:
        median = sample[len(sample) // 2]

    return median

median = calculate_median(sample)
print(f"Median of the sample equals to: {median}")


Median of the sample equals to: 15


### Mode

The value (number) that appears the most often in the dataset.

In [13]:
def calculate_mode(sample):
    """
    Calculate the mode of a sample.
    If there are multiple modes, return all of them.
    """
    max_occur = 0
    occurrences = {}

    for value in sample:
        if value not in occurrences:
            occurrences[value] = 0
        else:
            occurrences[value] += 1

        if occurrences[value] > max_occur:
            max_occur = occurrences[value]

    if max_occur <= 1:
        return None

    return [key for key, value in occurrences.items() if value == max_occur]


mode = calculate_mode(sample)
print(f"Mode of the sample equals to: {mode}")


Mode of the sample equals to: [10, 15]


### Variance

Indicates how data is spread around _Mean_.

- _σ2_ - _Variance_ of the _Population_
- _S2_ - _Variance_ of the _Sample_


In [14]:
def calculate_variance(sample, mean):
    """Calculate the variance of a sample."""
    sample_sum = 0

    for value in sample:
        sample_sum += (value - mean) ** 2

    return sample_sum / (len(sample) - 1)


variance = calculate_variance(sample, mean)
print(f"Variance of the sample equals to: {variance}")


Variance of the sample equals to: 52.02777777777777


### Standard deviation

A measure of how dispersed the data is in relation to the mean.

Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.

- _σ_ - _Standard deviation_ of the _Population_
- _S_ - _Standard deviation_ of the _Sample_

In [15]:
from math import sqrt


def calculate_standard_deviation(variance):
    """Calculate the standard deviation of a sample."""
    return sqrt(variance)


standard_deviation = calculate_standard_deviation(variance)
print(f"Standard deviation of the sample equals to: {standard_deviation}")

Standard deviation of the sample equals to: 7.213028336127467


### Coefficient of variation

Used to compare different datasets.

In [16]:
sample_miles = [3, 4, 4.5, 3.5]
sample_kilometers = [4.828, 6.437, 7.242, 5.632]

mean_miles = calculate_mean(sample_miles)
variance_miles = calculate_variance(sample_miles, mean_miles)
standard_deviation_miles = calculate_standard_deviation(variance_miles)

mean_kilometers = calculate_mean(sample_kilometers)
variance_kilometers = calculate_variance(sample_kilometers, mean_kilometers)
standard_deviation_kilometers = calculate_standard_deviation(
    variance_kilometers)


def calculate_coefficient_of_variation(standard_deviation, mean):
    """Calculate the coefficient of variation of a sample."""
    return standard_deviation / mean


coefficient_of_variation_miles = calculate_coefficient_of_variation(
    standard_deviation_miles, mean_miles)

print(
    f"Coefficient of variation of the sample in miles equals to: {round(coefficient_of_variation_miles, 4)}")


coefficient_of_variation_kilometers = calculate_coefficient_of_variation(
    standard_deviation_kilometers, mean_kilometers)

print(
    f"Coefficient of variation of the sample in kilometers equals to: {round(coefficient_of_variation_kilometers, 4)}")


Coefficient of variation of the sample in miles equals to: 0.1721
Coefficient of variation of the sample in kilometers equals to: 0.1721


### Covariance

Tells us if 2 values are moving in the same direction.

- if `COV > 0` values are moving together
- if `COV < 0` values are moving in opposite direction
- if `COV = 0` values are independent

In [17]:
sample_market_cap = [1532, 1488, 1343, 928, 615]
sample_earnings = [58, 35, 75, 41, 17]

mean_market_cap = calculate_mean(sample_market_cap)
mean_earnings = calculate_mean(sample_earnings)

print(mean_market_cap)
print(mean_earnings)


def calculate_covariance(sample_x, sample_y, mean_x, mean_y):
    """Calculate the covariance of two samples."""
    cov = 0

    for value_x, value_y in zip(sample_x, sample_y):
        cov += (value_x - mean_x) * (value_y - mean_y)

    return cov / len(sample_x) - 1


covariance_mc_e = calculate_covariance(
    sample_market_cap, sample_earnings, mean_market_cap, mean_earnings)
print(
    f"Covariance of the samples equals to: {covariance_mc_e}, that means they are positively correlated.")


1181.2
45.2
Covariance of the samples equals to: 4641.56, that means they are positively correlated.


### Corelation coefficient

Adjust _Covariance_ to see relationships.

- is a value between `0` and `1`
- `0` values means independence
- `1` perfect corelation
- `-1` shows inverse corelation

In [18]:
variance_market_cap = calculate_variance(sample_market_cap, mean_market_cap)
standard_deviation_market_cap = calculate_standard_deviation(
    variance_market_cap)

variance_earnings = calculate_variance(sample_earnings, mean_earnings)
standard_deviation_earnings = calculate_standard_deviation(variance_earnings)


def calculate_corelation_coefficient(covariance, standard_deviation_x, standard_deviation_y):
    """Calculate the correlation coefficient of two samples."""
    return covariance / (standard_deviation_x * standard_deviation_y)


corelation_coefficient_mc_e = calculate_corelation_coefficient(
    covariance_mc_e, standard_deviation_market_cap, standard_deviation_earnings)

print(
    f"Corelation coefficient of the samples equals to: {round(corelation_coefficient_mc_e, 4)}, that means they are positively correlated.")


Corelation coefficient of the samples equals to: 0.528, that means they are positively correlated.


## Distributions

Shows the basics of data distributions and errors.

### Probability distribution

Finds _Probability_ of different outcomes.

- _Coin flip_ probability is `P = 1/2 = 0.5`
- _Dice roll_ probability is `P = 1/6 = 0.167`
- _Probability_ sum is always `1`

Probability distribution of dice rolls:

| Sum | Probability |
|-----|------------|
|  2  |    1/36    |
|  3  |    2/36    |
|  4  |    3/36    |
|  5  |    4/36    |
|  6  |    5/36    |
|  7  |    6/36    |
|  8  |    5/36    |
|  9  |    4/36    |
| 10  |    3/36    |
| 11  |    2/36    |
| 12  |    1/36    |

This probability distribution shows the number of possible ways to achieve each sum (for example, there are 6 ways to get a sum of 7, but only one way to get a sum of 2).