##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms

##### ***** Import required Classes and modules

In [None]:
from dpk_fdedup.transform_python import Fdedup

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/ray/test-data/input/ | folder that contains the input parquet files for the fuzzy dedup algorithm |
| output_folder:str | \${PWD}/ray/output/ | folder that contains the all the intermediate results and the output parquet files for the fuzzy dedup algorithm |
| contents_column:str | contents | name of the column that stores document text |
| document_id_column:str | int_id_column | name of the column that stores document ID |
| num_permutations:int | 112 | number of permutations to use for minhash calculation |
| num_bands:int | 14 | number of bands to use for band hash calculation |
| num_minhashes_per_band | 8 | number of minhashes to use in each band |
| operation_mode:{filter_duplicates,filter_non_duplicates,annotate} | filter_duplicates | operation mode for data cleanup: filter out duplicates/non-duplicates, or annotate duplicate documents |

In [None]:
Fdedup(input_folder='test-data/input',
    output_folder='output',
    contents_column= "contents",
    document_id_column= "int_id_column",
    num_permutations= 112,
    num_bands= 14,
    num_minhashes_per_band= 8,
    operation_mode="filter_duplicates").transform()


##### **** The specified folder will include the transformed parquet files.

In [None]:
import glob
glob.glob("output/cleaned/*")

***** print the input data

In [None]:
import polars as pl
import os
input_df_1 = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "input", "data_1", "df1.parquet"))
input_df_2 = pl.read_parquet(os.path.join(os.path.abspath(""), "test-data", "input", "data_2", "df2.parquet"))
input_df = input_df_1.vstack(input_df_2)

with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):
    print(input_df)

***** print the output result

In [None]:
import polars as pl
output_df_1 = pl.read_parquet(os.path.join(os.path.abspath(""), "output", "cleaned", "data_1", "df1.parquet"))
output_df_2 = pl.read_parquet(os.path.join(os.path.abspath(""),  "output", "cleaned", "data_2", "df2.parquet"))
output_df = output_df_1.vstack(output_df_2)
with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):
    print(output_df)