# Statistics for Data Science & Machine Learning

This notebook contains all of the basic concepts someone should know to do Data Science and Machine learning. This notebook is based on the video made by [Derek Banas](https://www.youtube.com/@derekbanas) called [Statistics for Data Science & Machine Learning
](https://www.youtube.com/watch?v=tcusIOfI_GM&list=PLGLfVvz_LVvQy4mkmEvtFwZGg1S38MUmn).


## Basic concepts

### Definitions:

- _Population_ - all items or people of interest in what you are analyzing
- _Sample_ - subset of the _Population_ that we can analyze
- _Success_ - results we are look for in a sample (age, is car owner or not, how owner or not)

### Markups:

- _M_ - _Successes_ in _Population_
- _x_ - _Successes_ in _Sample_
- _N_ - total _Population_
- _n_ - total _Sample_ from _Population_


## Types of data

### Categorical

Any answer to a yes or no question, for example:

- Home owner
- Age 50+
- College graduate

### Numerical

This data can be:

- _Finite_ - has an ending value
- _Infinite_ - does not have an ending value

### Continuous

Data tha can be broken down into infinitely smaller amounts, for example height, weight, time.

### Qualitative

This type of data can be:

- _Nominal_ - mainly data for naming something that does not have an order (for example fruits)
- _Ordinal_ - data that does has an order (bad, ok, good, great)

### Quantitative

Mainly separated into:

- _Interval_ - is a group of numbers that includes all numbers between the beginning and the end (1, 2, 3 is an interval from 1 to 3)
- _Ratio_ - shows how many times one number contains another. For example, if there are eight oranges and six lemons in a bowl of fruit, then the ratio of oranges to lemons is eight to six (8/6)


## Charts and tools

### Pareto chart

A Pareto chart is a type of bar chart that combines both bar and line graphs. It visually represents the 80/20 rule, also known as the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. Simple explanation of the chart can be found in [this video](https://www.youtube.com/watch?v=ltBw6kwD3_o).

### Frequency distribution table

Represents a number of occurrences for a particular category.

| Category | Frequency |
| -------- | --------- |
| A        | 10        |
| B        | 15        |
| C        | 8         |
| D        | 12        |

### Frequency histogram

A histogram is a graphical representation of the distribution of a dataset. It displays the frequencies or counts of different numerical values or intervals within a given range. Simple explanation of the histogram can be found [here](https://www.youtube.com/watch?v=8TV5ha9nqm0&ab_channel=HarvardOnline).


## Calculation of the basic terms

We are going to calculate basic things for the test sample (see below).

In [9]:
sample = [10, 10, 20, 15, 125, 100]

### Mean (average)

It's basically an average sum of values of components divided by the number of components.

- _μ_ - _Mean_ of _Population_
- _x̄_ - _Mean_ of the _Sample_

In [10]:
def calculate_mean(sample):
    """Calculate the mean of a sample."""
    return sum(sample) / len(sample)

mean = calculate_mean(sample)
print(f"Mean of the sample equals to: {mean}")

Mean of the sample equals to: 46.666666666666664


### Median

The value in the middle of the dataset.

In [11]:
def calculate_median(sample):
    """
    Calculate the median of a sample.
    Single middle value if the sample has an odd number of elements.
    Average of two middle values if the sample has an even number of elements.
    """
    sample.sort()

    if len(sample) % 2 == 0:
        median = (sample[len(sample) // 2 - 1] + sample[len(sample) // 2]) / 2
    else:
        median = sample[len(sample) // 2]

    return median

median = calculate_median(sample)
print(f"Median of the sample equals to: {median}")


Median of the sample equals to: 17.5


### Mode

The value (number) that appears the most often in the dataset.

In [12]:
def calculate_mode(sample):
    """Calculate the mode of a sample."""
    mode = max(sample, key=sample.count)
    return mode


mode = calculate_mode(sample)
print(f"Mode of the sample equals to: {mode}")

Mode of the sample equals to: 10
