# New idea: histogram similarity

The histogram similarity would simply for each time of day make a histogram of all the consumption measurements of a certain profile during that timestamp.  
To compare two profiles all time-of-day histograms of the two profiles are compared using the wasserstein_distance (implementation in [scipy.stats.wasserstein_distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html)). 

## Wasserstein distance

This distance is also known as the earth mover's distance, since it can be
seen as the minimum amount of "work" required to transform $u$ into
$v$, where "work" is measured as the amount of distribution weight
that must be moved, multiplied by the distance it has to be moved.

**Interestingly** You can weight the different bins of a histogram to give more weight to some bins! We can use this to put extra emphasis on high peaks if we want! 


In [None]:
from scipy.stats import wasserstein_distance
wasserstein_distance([1,2,3], [4,5,6])

### Building the histograms

To build the histograms, we need to ensure that the bins are the same for each profile OR that the bins are the same for each pairs of profiles

In [None]:
from energyclustering.data.fluvius import read_data_pickle
info_df, data_df = read_data_pickle(include_incomplete_profiles = False, process_errors = True)
data_df = data_df.rename_axis('timestamp', axis = 1)

In [None]:
subset = data_df.sample(5, random_state = 1234)
subset

### Try overall histogram
So this is also not good! Because the histogram boundaries are different for every timestep.  
In this way a difference of 1kW during a timestamp with a large range will contribute less to the overall distance than a difference of 1kW for a timestamp with a lower range. 

In [None]:
import pandas as pd 
import numpy as np
import itertools

In [None]:
def add_date(series):
    return pd.to_datetime(series, format='%H:%M:%S', exact = False)

In [None]:
daily_df = (
        data_df
        .rename_axis('timestamp', axis = 1)
        .stack().to_frame('value')
        .assign(
            time=lambda x: add_date(x.index.get_level_values('timestamp').time),
            date=lambda x: x.index.get_level_values('timestamp').date.astype('str')
        )
        .pipe(lambda x: pd.pivot_table(x, index=['meterID','year','date'], columns=['time'], values='value', dropna= False))
        # go to hourly consumption
        .resample('4H', axis = 1).sum()
    )

daily_df

## Visualise

In [None]:
import seaborn as sns
sns.set_theme(style="whitegrid")
import matplotlib.pyplot as plt
import altair as alt
alt.data_transformers.disable_max_rows()
from energyclustering.visualization.cluster_visualization import all_day_plot

In [None]:
PROFILE_TO_PLOT = data_df.index[3]
PROFILE2_TO_PLOT = data_df.index[0]
plot_df1 = daily_df.loc[PROFILE_TO_PLOT].stack().to_frame('value').reset_index().assign(time = lambda x: x.time.dt.strftime('%H:%M'))
plot_df2 = daily_df.loc[PROFILE2_TO_PLOT].stack().to_frame('value').reset_index().assign(time = lambda x: x.time.dt.strftime('%H:%M'))

fig, axes = plt.subplots(1,2, figsize = (14,6), sharey = True)
ax = sns.violinplot(ax = axes[0], x="time", y="value", data=plot_df1)
ax = sns.violinplot(ax = axes[1], x = 'time', y='value', data=plot_df2)



In [None]:
all_day_plot(PROFILE_TO_PLOT, data_df.resample('4H', axis = 1).sum()).properties(width = 1000)

In [None]:
daily_df.index.droplevel('date').unique()

In [None]:
min_values, max_values = daily_df.min(axis = 0), daily_df.max(axis = 0)
max_values

In [None]:
NB_BINS = 10
histogram_dict = dict()
for profile, profile_df in daily_df.groupby('meterID'):
    histograms = np.zeros((24, NB_BINS))
    for idx, column in enumerate(profile_df.columns): 
        values = profile_df[column]  
        hist, bin_edges = np.histogram(values, bins = NB_BINS, range=(min_values[column], max_values[column]))
        histograms[idx, :] = hist
    histogram_dict[profile] = histograms
        

In [None]:
distance_entries = []
for profile1, profile2 in itertools.combinations(histogram_dict.keys(), 2): 
    distances = []
    for histogram_idx in range(histogram_dict[profile1].shape[0]):
        distance = wasserstein_distance(histogram_dict[profile1][histogram_idx], histogram_dict[profile2][histogram_idx])
        distances.append(distance)
    distance_entries.append((profile1, profile2, np.sum(distances)))
distance_df = pd.DataFrame(distance_entries, columns = ['p1', 'p2', 'distance'])

In [None]:
distance_df.sort_values('distance')

## Try pairwise histogram
So of course this does not work! Because the scale should be the same for all comparisons. (e.g. if 10 bins becomes a difference from 1kW another comparison where 10 bins is equal to 10 kW) 

In [None]:
def histogram_distance(values1, values2): 
    NB_BINS = 10
    min_value = min(values1.min(), values2.min())
    max_value = max(values1.max(), values2.max())
    hist1, _ = np.histogram(values1, NB_BINS, range = (min_value, max_value))
    hist2, _ = np.histogram(values2, NB_BINS, range = (min_value, max_value))
    return wasserstein_distance(hist1, hist2)
    

In [None]:
distance_entries = []
profiles = daily_df.index.get_level_values(0).unique()
for p1, p2 in itertools.combinations(profiles, 2): 
    p1_df = daily_df.loc[p1]
    p2_df = daily_df.loc[p2] 
    distances = []
    for column in p1_df: 
        distance = histogram_distance(p1_df[column], p2_df[column]) 
        distances.append(distance)
    distance_entries.append((p1, p2, np.sum(distances)))
distance_df = pd.DataFrame(distance_entries, columns = ['p1', 'p2', 'distance'])

In [None]:
distance_df.sort_values('distance')

# Look at results

In [None]:
from energyclustering.webapp.resultparser import ResultParser, ResultComparison

In [None]:

HIST = 'histogram_bins_20'
hist_result = ResultParser('result_20210628_koen', HIST)

In [None]:
hist_result.distance_matrix