![lop](../../images/logo_diive1_128px.png)

<span style='font-size:40px; display:block;'>
<b>
    Histogram
</b>
</span>

---
**Notebook version**: `1` (16 Mar 2024)  
**Author**: Lukas Hörtnagl (holukas@ethz.ch)

</br>

# **Description**

- Calculate histogram from input series.
- This example calculates a histogram of found CO2 time lags (in seconds) in relation to wind measurements.
- The histgram can be calculated
    - with a specific number of bins.
    - with a specific number of bins, but excluding a defined number of fringe bins at the start and the end of the histogram.
    - with separate bins for each unique value in the data.
    - with separate bins for each unique value in the data, but excluding a defined number of fringe bins at the start and the end of the histogram.
- The optional exclusion of fringe bins was implemented because some histograms are characterized by an undesired accumulation of values at the start or the end of the historgram, resulting in distribution peaks that could mask the "true" distribution peak.

</br>

# **Imports**

In [1]:
import importlib.metadata
import warnings
from datetime import datetime

from diive.configs.exampledata import load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN  # Example data
from diive.pkgs.analyses.histogram import Histogram

warnings.filterwarnings("ignore")
version_diive = importlib.metadata.version("diive")
print(f"diive version: v{version_diive}")

diive version: v0.79.0


</br>

# **Load example data**

In [2]:
data_df, metadata_df = load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN()
data_df.head()

Reading file exampledata_EDDYPRO-FLUXNET-CSV-30MIN_CH-AWS_2022.07_FR-20220127-164245_eddypro_fluxnet_2022-01-28T112538_adv.csv ...


Unnamed: 0_level_0,TIMESTAMP_START,DOY_START,DOY_END,FILENAME_HF,SW_IN_POT,NIGHT,EXPECT_NR,FILE_NR,CUSTOM_FILTER_NR,WD_FILTER_NR,SONIC_NR,T_SONIC_NR,CO2_NR,H2O_NR,CH4_NR,NONE_NR,TAU_NR,H_NR,FC_NR,LE_NR,FCH4_NR,FNONE_NR,TAU,H,LE,...,BADM_INSTPAIR_EASTWARD_SEP_GA_NONE,BADM_INSTPAIR_HEIGHT_SEP_GA_NONE,BADM_INST_GA_CP_TUBE_LENGTH_GA_NONE,BADM_INST_GA_CP_TUBE_IN_DIAM_GA_NONE,BADM_INST_GA_CP_TUBE_FLOW_RATE_GA_NONE,HPATH_GA_NONE,VPATH_GA_NONE,RESPONSE_TIME_GA_NONE,NUM_CUSTOM_VARS,CUSTOM_DATA_SIZE_IRGA72_MEAN,CUSTOM_STATUS_CODE_IRGA72_MEAN,CUSTOM_GA_DIAG_CODE_IRGA72_MEAN,CUSTOM_SIGNAL_STRENGTH_IRGA72_MEAN,CUSTOM_H2O_MEAN,CUSTOM_CO2_MEAN,CUSTOM_AIR_P_MEAN,CUSTOM_COOLER_V_MEAN,CUSTOM_FLOWRATE_MEAN,NUM_BIOMET_VARS,LW_IN_1_1_1,PA_1_1_1,PPFD_IN_1_1_1,RH_1_1_1,SW_IN_1_1_1,TA_1_1_1
TIMESTAMP_MIDDLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
2021-07-01 00:15:00,202107000000.0,182.0,182.021,,0.0,1.0,36000.0,36000.0,36000.0,36000.0,36000.0,36000.0,35988.0,35988.0,,,36000.0,36000.0,35988.0,35988.0,,,-0.012152,-4.93653,6.69808,...,,,,,,,,,9.0,26.0,0.0,8191.0,100.05,340.582,14.2538,80198.2,1.91066,0.000259,6.0,329.076,80.1982,0.0,100.0,0.0,5.5588
2021-07-01 00:45:00,202107000000.0,182.021,182.042,,0.0,1.0,36000.0,36000.0,36000.0,36000.0,36000.0,36000.0,35988.0,35988.0,,,36000.0,36000.0,35988.0,35988.0,,,-0.01773,4.44042,-0.171644,...,,,,,,,,,9.0,26.0,0.0,8191.0,100.05,344.452,14.1511,80203.3,1.91007,0.000259,6.0,340.771,80.2033,0.0,100.0,0.0,5.59079
2021-07-01 01:15:00,202107000000.0,182.042,182.062,,0.0,1.0,36000.0,36000.0,36000.0,36000.0,36000.0,36000.0,35988.0,35988.0,,,36000.0,36000.0,35988.0,35988.0,,,-0.034328,3.4912,-1.03688,...,,,,,,,,,9.0,26.0,0.0,8191.0,100.05,343.712,14.1364,80210.6,1.90857,0.000259,6.0,341.942,80.2106,0.0,99.862,0.0,5.5835
2021-07-01 01:45:00,202107000000.0,182.062,182.083,,0.0,1.0,36000.0,36000.0,36000.0,36000.0,36000.0,36000.0,35988.0,35988.0,,,36000.0,36000.0,35988.0,35988.0,,,-0.018509,4.38549,-0.313793,...,,,,,,,,,9.0,26.0,0.0,8191.0,100.05,342.139,14.1158,80206.5,1.90679,0.000259,6.0,349.003,80.2064,0.0,99.3662,0.0,5.5514
2021-07-01 02:15:00,202107000000.0,182.083,182.104,,0.0,1.0,36000.0,36000.0,36000.0,36000.0,36000.0,36000.0,35988.0,35988.0,,,36000.0,36000.0,35988.0,35988.0,,,-0.022376,4.95141,1.61602,...,,,,,,,,,9.0,26.0,0.0,8191.0,100.05,343.733,14.126,80192.1,1.90773,0.000259,6.0,346.874,80.1922,0.0,99.938,0.0,5.59066


In [3]:
series = data_df['CO2_TLAG_ACTUAL'].copy()
print("Time series of found CO2 time lags in seconds:")
series.head()

Time series of found CO2 time lags in seconds:


TIMESTAMP_MIDDLE
2021-07-01 00:15:00    1.30
2021-07-01 00:45:00    1.25
2021-07-01 01:15:00    1.30
2021-07-01 01:45:00    1.20
2021-07-01 02:15:00    1.30
Freq: 30min, Name: CO2_TLAG_ACTUAL, dtype: float64

</br>

# **Calculate histogram**

</br>

## (1) Specific number of bins
- A simple histogram with 10 bins.

In [4]:
hist = Histogram(
    s=series,
    method='n_bins',
    n_bins=10,
    ignore_fringe_bins=None
)

In [5]:
hist.results

Unnamed: 0,BIN_START_INCL,COUNTS
0,0.0,10
1,0.5,2
2,1.0,746
3,1.5,3
4,2.0,1
5,2.5,0
6,3.0,4
7,3.5,4
8,4.0,5
9,4.5,36


In [6]:
# Show the five bins with highest counts in decreasing order
hist.peakbins

[1.0, 4.5, 0.0, 4.0, 3.0]

- Peak distribution was found for the bin at 1s.
- In addition, several counts were found for the bins at 4.5s, 0s, 4s and 3s.
- Here, each bin covers a time period of 0.5s, which means that the peak bin at 1s covers all values between 1s (inclusive) and 1.5s (exclusive).

</br>

## (2) Specific number of bins, but excluding fringe bins
- Histogram with 10 bins, but the first bin and the last two bins are ignored.

In [7]:
hist = Histogram(
    s=series,
    method='n_bins',
    n_bins=10,
    ignore_fringe_bins=[1, 2]
)

In [8]:
hist.results

Unnamed: 0,BIN_START_INCL,COUNTS
0,0.5,2
1,1.0,746
2,1.5,3
3,2.0,1
4,2.5,0
5,3.0,4
6,3.5,4


In [9]:
hist.peakbins

[1.0, 3.0, 3.5, 1.5, 0.5]

- Peak distribution was found for the bin at 1s.
- In addition, several counts were found for the bins at 3s, 3.5s, 1.5s and 0.5s.
- As defined, the fringe bins at 0s, 4s and 4.5s were ignored.
- Here, each bin covers a time period of 0.5s, which means that the peak bin at 1s covers all values between 1s (inclusive) and 1.5s (exclusive).

</br>

## (3) Separate bin for each unique value
- Histogram with many bins, whereby bins correspond to unique values found in the dataset.

In [10]:
hist = Histogram(
    s=series,
    method='uniques',
    ignore_fringe_bins=None
)

In [11]:
hist.results

Unnamed: 0,BIN_START_INCL,COUNTS
0,0.0,8
1,0.05,1
2,0.25,1
3,0.85,1
4,0.95,1
5,1.05,3
6,1.1,4
7,1.15,8
8,1.2,28
9,1.25,245


In [12]:
hist.peakbins

[1.3, 1.25, 1.35, 1.2, 4.9]

- Peak distribution was found for the bin at 1.3s.
- In addition, several counts were found for the bins at 1.25s, 1.35s, 1.2s and 4.9s.
- Here, each bin covers a time period of 0.05s, which means that the peak bin at 1.3s covers all values between 1.3s (inclusive) and 1.35s (exclusive).

</br>

## (4) Separate bin for each unique value, but excluding fringe bins
- Histogram with many bins, whereby bins correspond to unique values found in the dataset. The first eight and the last 20 bins are ignored.

In [13]:
hist = Histogram(
    s=series,
    method='uniques',
    ignore_fringe_bins=[8, 20]
)

In [14]:
hist.results

Unnamed: 0,BIN_START_INCL,COUNTS
0,1.2,28
1,1.25,245
2,1.3,398
3,1.35,55
4,1.4,5
5,1.5,1
6,1.75,1


In [15]:
hist.peakbins

[1.3, 1.25, 1.35, 1.2, 1.4]

- Peak distribution was found for the bin at 1.3s.
- In addition, many counts were found for the bins at 1.25s, 1.35s, 1.2s and 1.4s.
- Many fringe bins were ignored, including the bin at 4.9s with 23 counts.
- Here, each bin covers a time period of 0.05s, which means that the peak bin at 1.3s covers all values between 1.3s (inclusive) and 1.35s (exclusive).

</br>

# **End of notebook**

In [16]:
dt_string = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"Finished {dt_string}")

Finished 2024-08-22 16:08:13
