***Header_cleanser Transform Sample Notebook***

These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:

make venv \
source venv/bin/activate \
pip install jupyterlab \
./python/venv/bin/jupyter lab

In [1]:
%%capture
# Users and application developers must use the right tag for the latest version from pypi
!pip install scancode-toolkit
!pip install data-prep-toolkit==0.2.2.dev2
!pip install 'data-prep-toolkit-transforms[header_cleanser]==0.2.2.dev2'
!pip install pandas

### Configure the transform parameters. 
* Define the transform parameters required for processing. Below are the parameters specific to the Header Cleanser Transform: 

    * header_cleanser_contents_column_name: Column containing code to cleanse (default: contents).
    * header_cleanser_copyright: Whether to remove copyright headers (default: True).
    * header_cleanser_license: Whether to remove license headers (default: True).

In [2]:
from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from header_cleanser_transform_python import HeaderCleanserPythonTransformConfiguration

***Specify input/output folders and parameters***

In [3]:
# Input/output configuration
local_conf = {
    "input_folder": "path/to/your/input/folder",  # Adjust path for input files
    "output_folder": "path/to/your/output/folder",  # Adjust path for output files
}

# Parameters for the transform
params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "header_cleanser_contents_column_name": "contents",
    "header_cleanser_copyright": True,
    "header_cleanser_license": True
}

***Invoke the header_cleanser transformation***
* Launch the transform using the PythonTransformLauncher.

In [4]:
import sys
sys.argv = ParamsUtils.dict_to_req(d=(params))  
# create launcher
launcher = PythonTransformLauncher(HeaderCleanserPythonTransformConfiguration())
# launch
return_code = launcher.launch()

11:05:41 INFO - pipeline id pipeline_id
11:05:41 INFO - code location None
11:05:41 INFO - data factory data_ is using local data access: input_folder - path/to/your/input/folder output_folder - path/to/your/output/folder
11:05:41 INFO - data factory data_ max_files -1, n_sample -1
11:05:41 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
11:05:41 INFO - orchestrator header_cleanser started at 2025-01-09 11:05:41
11:05:41 ERROR - No input files to process - exiting
11:05:41 INFO - Completed execution in 0.0 min, execution result 0


***Checking the output Parquet file***

In [None]:
import pyarrow.parquet as pq
import pandas as pd
table = pq.read_table('path/to/your/output/folder/sample.parquet')
table.to_pandas()

Unnamed: 0,contents
0,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<!--\n..."
1,"/*\n * Copyright 2018 Makoto Consulting Group,..."
2,"<?xml version=""1.0"" encoding=""UTF-8""?>\n\n<!--..."
3,"/*\n Copyright 2018 Makoto Consulting Group,..."
4,# Copyright 2016 The TensorFlow Authors. All R...
5,"<?xml version=""1.0"" encoding=""UTF-8""?>\n\n<!--..."
6,"/*\n * Licensed under the Apache License, Vers..."
7,#! \n#\n# Script to run the DataCreator progra...
8,#!/bin/bash\n\n###############################...
9,# Copyright IBM Corp. and others 2018\n#\n# Th...


In [6]:
table.to_pandas()['contents'][0]

'<?xml version="1.0" encoding="UTF-8"?>\n<!--\nCopyright IBM Corp. and others 2006\n\nThis program and the accompanying materials are made available under\nthe terms of the Eclipse Public License 2.0 which accompanies this\ndistribution and is available at https://www.eclipse.org/legal/epl-2.0/\nor the Apache License, Version 2.0 which accompanies this distribution and\nis available at https://www.apache.org/licenses/LICENSE-2.0.\n\nThis Source Code may also be made available under the following\nSecondary Licenses when the conditions for such availability set\nforth in the Eclipse Public License, v. 2.0 are satisfied: GNU\nGeneral Public License, version 2 with the GNU Classpath\nException [1] and GNU General Public License, version 2 with the\nOpenJDK Assembly Exception [2].\n\n[1] https://www.gnu.org/software/classpath/license.html\n[2] https://openjdk.org/legal/assembly-exception.html\n\nSPDX-License-Identifier: EPL-2.0 OR Apache-2.0 OR GPL-2.0-only WITH Classpath-exception-2.0 OR 

### Notes for Users and Developers
1. Ensure that your input files are placed in the specified input_folder path.
    * For sample input files, refer to the python/test-data/input folder.
2. Use the latest tagged version from PyPI for stability.
3. Transform parameters can be customized as per requirements. Update params accordingly.