##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
venv/bin/jupyter lab
```

In [None]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install 'data-prep-toolkit-transforms[ray,gneissweb_classification]'

##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: 
| Configuration Parameters | Default  | Description |
|------------|----------|--------------|
| gcls_model_credential | _unset_ | specifies the credential you use to get modela. This will be huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens) |
| gcls_model_file_name | _unset_ | specifies what filename of models you use to get models, like [`fasttext_medical.bin`] |
| gcls_model_url | _unset_ |  specifies url that models locate. For fasttext, this will be repo name of the models, like [`ibm-granite/GneissWeb.Med_classifier`] |
| gcls_n_processes | 1 | number of processes. Must be a positive integer |
| gcls_content_column_name | `contents` | specifies name of the column containing documents |
| gcls_output_label_column_name | [`label`] | specifies name of the output columns to hold predicted classes |
| gcls_output_score_column_name | [`score`] | specifies name of the output columns to hold score of prediction |

##### ***** Import required classes and modules

In [1]:
from dpk_gneissweb_classification.ray.transform import Classification

##### ***** Setup runtime parameters for this transform

In [2]:
%%time
Classification(input_folder= "test-data/input",
        output_folder= "output",
        gcls_model_credential= "PUT YOUR OWN HUGGINGFACE CREDENTIAL",
        gcls_model_file_name= ["fasttext_medical.bin"],
        gcls_model_url= ["ibm-granite/GneissWeb.Med_classifier"],
        gcls_n_processes=2,
        gcls_output_label_column_name=["label"],
        run_locally= True,
        gcls_content_column_name= "text").transform()

10:36:20 INFO - parameters are : {'gcls_model_credential': 'PUT YOUR OWN HUGGINGFACE CREDENTIAL', 'gcls_model_file_name': ["['fasttext_medical.bin']"], 'gcls_model_url': ["['ibm-granite/GneissWeb.Med_classifier']"], 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': ["['label']"], 'gcls_output_score_column_name': ["['score']"], 'gcls_n_processes': 2}
10:36:20 INFO - pipeline id pipeline_id
10:36:20 INFO - code location None
10:36:20 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}
10:36:20 INFO - actor creation delay 0
10:36:20 INFO - job details {'job category': 'preprocessing', 'job name': 'gcls', 'job type': 'ray', 'job id': 'job_id'}
10:36:20 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
10:36:20 INFO - data factory data_ max_files -1, n_sample -1
10:36:20 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.p

CPU times: user 134 ms, sys: 115 ms, total: 249 ms
Wall time: 20 s


0

##### **** The specified folder will include the transformed parquet files.

In [3]:
import glob
glob.glob("output/*")

['output/metadata.json', 'output/test_01.parquet']