# Descriptive statistics – Min, Max, Medium etc. 

To get an overview of the data, descriptive statistics are used. These include values such as the range, median, and mean. They will help to observe the central tendency, identify outliers, and understand the overall behavior of the data.

For this reason, the `Pandas' library added.

In [1]:
from pathlib import Path
import pandas as pd
base_path = Path.cwd().parents[0]
INPUT = base_path / "00_data"

The data of interest is loaded into the environment.

In [None]:
data_path = INPUT / "clipped_layer.csv"
data = pd.read_csv(data_path)

Then the `describe` function of `Pandas` is used to create a summary. 

`````{admonition}  The describe function
:class: note
By default, the `describe` function creates a summary for the numerical columns.
`````

The statistical summary generated by `describe` function for each `Numerical` column:
- **`Count`**: Number of values
- **`Mean`**: Average of the values
- **`Std `**: spread of the data (Standard deviation)
- **`Min`**: The minimum value
- **`25%`**: lower quartile
- **`50%`**: median
- **`75%`**: upper quartile
- **`Max`**: The Maximum value

In [None]:
num_statistics = data.describe()
print(num_statistics)

`````{admonition} describe(include='all')
:class: tip
To obtain the summary statistics for all columns, including **both numerical and categorical** data, the `describe` function should be called with the `include='all'` parameter.
`````

The additional statistical summary generated by `describe(include='all')` for `Categorical` Columns:
- **`unique`**: number of distinct values
- **`top`**: most frequent value (mode)
- **`Freq`**: frequency of the top value

In [None]:
statistics = data.describe(include='all')
print(statistics)

To better understand the analysis, using a `plot` can offer a clearer visual summary of the data in each column.

Using `matplotlib` library with various packages allows for the creation of helpful visualizations.

In [None]:
import matplotlib.pyplot as plt

For example, `box` plot is a great way to quickly review the max, min, median, and quartiles of a dataset. It is just needed to call the `kind` of the plot which is needed for the required column of the dataset.

In [None]:
data['CLC_st1'].plot(kind='box')
plt.show()


For effectively visualizing the distribution of continuous data and frequency of data across consecutive intervals also the `histograms` can be used.

In [None]:
data['CLC_st1'].plot(kind='hist')
plt.show()

Also if the relationship between two variables in a dataset is important, plots such as `scatter` plot can be helpful for visualizing and understanding the relation.

In [None]:
df.plot(kind='scatter', x='CLC_st1', y='Biotpkt2018', alpha=0.2, color='purple')
plt.show()


`````{admonition} More Plots?
:class: tip
To find the most suitable plot based on the needs of the project, the provided link offers a wide range of plot types with different customization options.
https://matplotlib.org/stable/plot_types/index.html
`````