##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
venv/bin/jupyter lab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
%pip install 'data-prep-toolkit-transforms[gneissweb_classification]'
%pip install pandas

##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: 
| Configuration Parameters  | Default  | Description |
|------------|----------|--------------|
| gcls_model_credential | _unset_ | specifies the credential you use to get model. This will be huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens) |
| gcls_model_file_name | _unset_ | specifies what filename of models you use to get models, like [`fasttext_science.bin`] |
| gcls_model_url | _unset_ |  specifies url that models locate. For fasttext, this will be repo name of the models, like [`ibm-granite/GneissWeb.Sci_classifier`] |
| gcls_content_column_name | `contents` | specifies name of the column containing documents |
| gcls_output_label_column_name | [`label`] | specifies name of the output columns to hold predicted classes |
| gcls_output_score_column_name | [`score`] | specifies name of the output columns to hold score of prediction |

##### ***** Import required classes and modules

In [2]:
from dpk_gneissweb_classification.transform_python import Classification

##### ***** Setup runtime parameters for this transform

In [3]:
%%time
Classification(input_folder= "test-data/input",
        output_folder= "output",
        gcls_model_credential= "PUT YOUR OWN HUGGINGFACE CREDENTIAL",
        gcls_model_file_name= ["fasttext_gneissweb_quality_annotator.bin","fasttext_medical.bin"],
        gcls_model_url= ["ibm-granite/GneissWeb.Quality_annotator","ibm-granite/GneissWeb.Med_classifier"],
        gcls_output_label_column_name=["label_quality","label_med"],
        gcls_output_score_column_name=["score_quality","score_med"],
        gcls_content_column_name= "text").transform()

12:32:11 INFO - parameters are : {'gcls_model_credential': 'PUT YOUR OWN HUGGINGFACE CREDENTIAL', 'gcls_model_file_name': ["['fasttext_gneissweb_quality_annotator.bin', 'fasttext_medical.bin']"], 'gcls_model_url': ["['ibm-granite/GneissWeb.Quality_annotator', 'ibm-granite/GneissWeb.Med_classifier']"], 'gcls_content_column_name': 'text', 'gcls_output_label_column_name': ["['label_quality', 'label_med']"], 'gcls_output_score_column_name': ["['score_quality', 'score_med']"], 'gcls_n_processes': 1}
12:32:11 INFO - pipeline id pipeline_id
12:32:11 INFO - code location None
12:32:11 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
12:32:11 INFO - data factory data_ max_files -1, n_sample -1
12:32:11 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:32:11 INFO - orchestrator gcls started at 2025-02-17 12:32:11
12:32:11 INF

CPU times: user 3 s, sys: 2.86 s, total: 5.85 s
Wall time: 13.6 s


0

##### **** The specified folder will include the transformed parquet files.

In [4]:
import glob
glob.glob("output/*")

['output/metadata.json', 'output/test_01.parquet']

In [5]:
import pandas as pd
pd.read_parquet('output/test_01.parquet', engine='pyarrow')

Unnamed: 0,text,id,dump,url,date,file_path,language,language_score,token_count,watsonnlp_top_category0,...,avg_grade_level,mcalpine_eflaw_textstat,dclm_fasttext_label,dclm_fasttext_score,cosmo_10k_edu_fasttext_label,cosmo_10k_edu_fasttext_score,label_quality,score_quality,label_med,score_med
0,A staffer sells cars via livestream at a deale...,<urn:uuid:567e2e87-397a-4119-93e9-d72d59b61f90>,CC-MAIN-2023-14,https://peoplesdaily.pdnews.cn/business/vehicl...,2023-03-27T23:11:21Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.967074,1239,automotive,...,9.436667,22.1,cc,0.002249,cc,0.012263,cc,0.987,cc,0.994
1,The May 1st submission deadline may feel like ...,<urn:uuid:3330ddd2-9c19-4da4-8feb-c41d0ba3b65f>,CC-MAIN-2023-14,https://performancein.com/news/2019/01/29/all-...,2023-03-27T23:08:19Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.944369,418,news and politics,...,11.67,30.3,cc,5e-05,cc,6.7e-05,cc,0.999,cc,0.997
2,Yes! Cinnamon Oil is a great way to deter mice...,<urn:uuid:e8d2ac4f-cde2-4c45-afd9-3a19cfb86d4c>,CC-MAIN-2023-14,https://peskylittlecritters.com/does-cinnamon-...,2023-03-27T23:15:19Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.906198,490,food & drink,...,8.98,26.2,cc,0.009224,cc,0.021643,cc,0.978,cc,0.844
3,Rosemary Oil can be used to deter cockroaches....,<urn:uuid:bd5c2a03-9a9b-43e2-872a-f7123213bea9>,CC-MAIN-2023-14,https://peskylittlecritters.com/does-rosemary-...,2023-03-27T23:18:25Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.916242,513,science,...,9.37,23.8,cc,0.007073,cc,0.005885,cc,0.994,cc,0.876
4,A cat might have discovered an insect crawling...,<urn:uuid:1922dc93-9fb8-4775-b147-88c589a7bd65>,CC-MAIN-2023-14,https://petcatty.com/why-does-my-cat-stare-at-...,2023-03-27T23:28:27Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.967236,1172,pets,...,6.396667,20.6,hq,0.960727,hq,0.881134,hq,0.881,cc,0.974
5,A staffer sells cars via livestream at a deale...,<urn:uuid:567e2e87-397a-4119-93e9-d72d59b61f90>,CC-MAIN-2023-14,https://peoplesdaily.pdnews.cn/business/vehicl...,2023-03-27T23:11:21Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.967074,1239,automotive,...,9.436667,22.1,cc,0.002249,cc,0.012263,cc,0.987,cc,0.994
6,The May 1st submission deadline may feel like ...,<urn:uuid:3330ddd2-9c19-4da4-8feb-c41d0ba3b65f>,CC-MAIN-2023-14,https://performancein.com/news/2019/01/29/all-...,2023-03-27T23:08:19Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.944369,418,news and politics,...,11.67,30.3,cc,5e-05,cc,6.7e-05,cc,0.999,cc,0.997
7,Yes! Cinnamon Oil is a great way to deter mice...,<urn:uuid:e8d2ac4f-cde2-4c45-afd9-3a19cfb86d4c>,CC-MAIN-2023-14,https://peskylittlecritters.com/does-cinnamon-...,2023-03-27T23:15:19Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.906198,490,food & drink,...,8.98,26.2,cc,0.009224,cc,0.021643,cc,0.978,cc,0.844
8,Rosemary Oil can be used to deter cockroaches....,<urn:uuid:bd5c2a03-9a9b-43e2-872a-f7123213bea9>,CC-MAIN-2023-14,https://peskylittlecritters.com/does-rosemary-...,2023-03-27T23:18:25Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.916242,513,science,...,9.37,23.8,cc,0.007073,cc,0.005885,cc,0.994,cc,0.876
9,A cat might have discovered an insect crawling...,<urn:uuid:1922dc93-9fb8-4775-b147-88c589a7bd65>,CC-MAIN-2023-14,https://petcatty.com/why-does-my-cat-stare-at-...,2023-03-27T23:28:27Z,s3://commoncrawl/crawl-data/CC-MAIN-2023-14/se...,en,0.967236,1172,pets,...,6.396667,20.6,hq,0.960727,hq,0.881134,hq,0.881,cc,0.974
