# De-Duplication Example

In this example, we'll show how you can use Kaskada's `Timestream.coalesce()` method to automatically de-duplicate data coming from multiple sources. Note that sources of the same type are used for the example, but that is not required.

We start with a script (created with the help of ChatGPT) to generate some test data for us to play with:

In [26]:
!python ./deduplication/event_creation.py

Then we initiate a Kaskada session:

In [3]:
import pandas as pd
import kaskada as kd

# Initialize Kaskada with a local execution context.
kd.init_session()

And load the generated data 2 ways:
* with Pandas DataFrames
* with Kaskada Sources

In [49]:
sources = []
dataFrames = []

for i in range(1,7):
    sources.append(
        await kd.sources.JsonlFile.create(
            f'feeding_events_{i}.jsonl',
            time_column = "timestamp", 
            key_column = "name",
            time_unit = "s",
            subsort_column = "subsort",
            grouping_name = "animals"
        )
    )
    dataFrames.append(
        pd.read_json(
            f'feeding_events_{i}.jsonl', 
            orient="records", 
            lines=True
            )
        )

Note: When creating Kaskada sources that you expect to merge and de-duplicate data, it is important to set the `grouping_name` the same and ensure that you have a `subsort_column` defined in your dataset. Kaskada will use the `timestamp_column` in combination with the `subsort_column` to automatically de-duplicate your data.

To merge multiple sources together, use the `Timestream.coalesce()` method as follows:

In [50]:
merged_timestream = kd.Timestream.coalesce(sources[0],sources[1],sources[2], sources[3], sources[4], sources[5])

In pandas we can use `concat()` to merge the dataFrames we loaded together (without de-duplication):

In [53]:
merged_df = pd.concat(dfs)

Now we can do a small analysis to verify that Kaskada worked desired:

In [54]:
kaskada_df = merged_timestream.to_pandas()

print(f'Values in kasakda: {len(kaskada_df["subsort"])}')
print(f'Unique values in kasakda: {len(kaskada_df["subsort"].unique())}')
print()
print(f'Values in pandas: {len(merged_df["subsort"])}')
print(f'Unique values in pandas: {len(merged_df["subsort"].unique())}')
print()

k_values = kaskada_df["subsort"].tolist()
p_values = merged_df["subsort"].unique().tolist()

k_values.sort()
p_values.sort()

print(f'Values in Kaskada match unique values in Pandas: {k_values == p_values}')

Values in kasakda: 27375
Unique values in kasakda: 27375

Values in pandas: 29875
Unique values in pandas: 27375

Values in Kaskada match unique values in Pandas: True
