### Merging of several StationData objects into one 

This notebook illustrates how one can merge several instances StationData objects into one objects. This merging only works for data from the same station and a typical case is if the data source files for one station are separated into many single files containing only parts of the data (e.g. if the files contain only one year of data). 

In the following, the example of the EBAS database is used for illustration. In particular, we will focus on the retrieval of the aerosol light scattering coefficients at 550 nm (**scatc550aer** in AEROCOM naming convention) for the station ***Jungfraujoch***, located in Germany.

In [1]:
import pyaerocom as pya

Initating pyaerocom configuration
Checking database access...
Checking access to: /lustre/storeA
Access to lustre database: True
Init data paths for lustre
Expired time: 0.084 s


#### Get list of all files containing scattering data for EBAS station Jungfraujoch

In [None]:
reader = pya.io.ReadEbas()
data = reader.read(vars_to_retrieve='scatc550aer', 
                            datalevel=2, station_names='Jungfraujoch')
print(data)

As you can see, the data has successfully been imported into an instance of the ``UngriddedData`` class. This class is organised *by file*, that is, for each of the 26 files that were imported, there is one metadata dictionary assigned. Let's look at the metadata from the first file:

In [None]:
data.metadata[0]

And the last one:

In [None]:
data.metadata[25]

As you can see, both files contain scattering data but do not share all the same metadata attributes (e.g. ``instrument_name`` is different, which might be due to technological updates over time). 

Let's have a look at the respective time-series for both stations. First, convert into instance of `StationData` class and then plot.

In [None]:
first_file = data.to_station_data(0, vars_to_convert='scatc550aer')
print(first_file)

In [None]:
last_file = data.to_station_data(25)
print(last_file)

In [None]:
first_file.plot_timeseries('scatc550aer');

In [None]:
last_file.plot_timeseries('scatc550aer');

As you can see, the files contain data from different years. Now, how can we get these objects into one object that contains the timeseries of both files from this station?

This is actually very easy:

In [None]:
merged = first_file.merge_other(last_file, 'scatc550aer')
print(merged)

As you can see in the output, the merging comprises not only the data arrays but also registers any differences in the assoicated metdadata (cf. e.g., sampling wavelength 550 nm vs. 525 nm, instrument name, PI)

Now, have a look at the merged timeseries data. 

In [None]:
merged.plot_timeseries('scatc550aer');

Looks okay. Let's merge all 26 files and see if we get a nice long time series.

Retrieve list of `StationData` objects:

In [None]:
stats = data.to_station_data('Jungfraujoch', 'scatc550aer', merge_if_multi=False)
print('Number of StationData objects retrieved: {}'.format(len(stats)))

Now merge them into one long time series:

In [None]:
merged = pya.helpers.merge_station_data(stats, var_name='scatc550aer')
print(merged)

And plot...

In [None]:
ax = merged.plot_timeseries('scatc550aer')
merged.plot_timeseries('scatc550aer', freq='daily', ax=ax)
merged.plot_timeseries('scatc550aer', freq='monthly', lw=3, ax=ax)
merged.plot_timeseries('scatc550aer', freq='yearly', ls='none', marker='o', ms=10, ax=ax);

#### Comment for convenience....

Actually, in the default setup you do not really need to think about all this. As you might have recognised, when creating the list of `StationData` objects from the `UngriddedData` object (using method `to_station_data`) we parsed the argument `merge_if_multi=False`.

The default here is `True`, so you can just go ahead and do:

In [None]:
data.to_station_data('Jungfraujoch', 'scatc550aer').plot_timeseries('scatc550aer')

What's happening here is, that `to_station_data` internally creates a list of `StationData` objects and uses the above illustrated method `merge_station_data` at the end if the input argument `merge_if_multi=True`. 

#### What about overlapping data ??

In some situations, there may be overlapping conflicts when merging multiple time series into one long time-series. In the following, we illustrate how these overlaps are handled if they occur. 

The method `merge_station_data` that is illustrated above has some features to handle overlapping data and in any case, all overlaps that were detected are stored in the `overlap` attribute of the merged `StationData` object. Let's check first if there are any overlaps in the Jungfraujoch data:

In [None]:
merged.overlap

Apparently, there is. You can check out these data (in comparison with the retrieved time series) as follows:

In [None]:
merged.plot_timeseries('scatc550aer', add_overlaps=True);

#### How to prioritise certain stations from others, when deciding what goes into overlap and what into the final timeseries?

The method `merge_station_data` provides a bunch of options to handle that. Things do not be written twice so please read the docstring of the method:

In [None]:
help(pya.helpers.merge_station_data)

In particular, `pref_attr` and `sort_by_largest` are relevant here. 

**NOTE**: if `pref_attr` is unspecified, then the stations are sorted based on the number of valid measurement points for the input variable. This was done in the merged time series that we retrieved above.

Now, in the following, let's not use the number of available data points (to sort the stations by relevance) but prefer stations that have a more recent data revision date.

In [None]:
try:
    merged_pref_awesomeness = pya.helpers.merge_station_data(stats, 'scatc550aer', pref_attr='awesomeness')
except pya.exceptions.MetaDataError as e:
    print('Failed merging, error: {}'.format(repr(e)))

Unfortunately, the `StationData` objects do not contain an attribute `awesomeness` by which we could sort. Let's go with 
`revision_date` instead:

In [None]:
# recompute stations, since we overwrote one above
stats = data.to_station_data('Jungfraujoch', 'scatc550aer', merge_if_multi=False)
merged_pref_recent_revision = pya.helpers.merge_station_data(stats, 'scatc550aer', pref_attr='revision_date')

In [None]:
merged_pref_recent_revision.plot_timeseries('scatc550aer', add_overlaps=True)

Let's see if there is any difference to the previous method (resample to daily resolution):

In [None]:
ax = merged.overlap['scatc550aer'].resample('D').mean().plot(style='x', label='Overlaps (prefer longer timeseries)',
                                                             figsize=(20, 8))

merged_pref_recent_revision.overlap['scatc550aer'].resample('D').mean().plot(style='o', 
                                                                             label='Overlaps (prefer more recent)',
                                                                             ax=ax, markerfacecolor='none')
ax.legend();

As you can see, the merging strategy can make an impact and it is important to define a reasonable strategy. In this case, it is certainly more reliable to use the data with the more recent revision date.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 12))
ax = plt.scatter(merged_pref_recent_revision.overlap['scatc550aer'], 
                 merged.overlap['scatc550aer'], marker='x')