In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here're several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Recommender System with TFX pipelines
The notebook builds MLOps components and pipelines using TFX for the recommender system [here.](https://www.kaggle.com/code/nicholeasuniquename/recommender-systems/)

1. Create a virtual environment for TFX compatability
2. Build the TFX components and upload to the public repository.
3. Download the components and test them in the virtual environment here.
4. EDA plots of raw data, joined data, and transformed data
5. Build the MLOps pipelines, upload to public repository.
6. Download the pipelines and test in the virtual environment here.


## 1. Creating the TFX compatible virtual environment

tfx version 1.16.0 is latest stable as of Sep 28, 2025

It is compatible with python 3.9 and 3.10 only.

The current kaggle python docker image uses python 3.11.13.

To use an earlier version of python on Kaggle, one can install conda and create a virtual environment that is based on an earlier version of python. 

Once conda is installed and a virtual environment is created for the earlier version of python, the virtual environment can be activated by activating conda and then activating the virtual environment.

A bash shell in the notebook that is invoked from the magic command %%bash is a bash session for the extent of that specific cell.
For each new session invoked by the cell %%bash, the 2 activation commands need to be invoked before using the virtual environment.

Aside from running scripts in the magic bash shell cells, we can also run scripts using the python subprocess library as long as we prepend commands with the 2 conda activation statements (see details in the definition for the run_command below).

We have 2 ways to run commands within the virtual environment.

The notebook itself is still using the kaggle docker image environment without the newly built virtual environment.
Even if we install and use ipykernel to register a kernel for the new virtual environment, I don't see a way to open the notebook to use the new kernel.  (In the Kaggle window, we have Session options, persistence option to persist files and variables, so it might be possible to restart the notebook with kernel selected as long as the kernel has Kaggle specific notebook support...)

In summary, the notebook as is can be used for intermediate steps of EDA where the EDA uses libraries that don't require an earlier version of python.  For MLOps steps that need an earlier version of python, the virtual environment is available.


In [2]:
!pwd
!echo $HOME

/kaggle/working
/root


%%bash: Executes the entire cell as a shell script. 

In [3]:
%%bash
t0=$(date +%s%N)
mkdir -p ~/miniconda3
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
#install conda and activate to /usr/local
bash ~/miniconda3/miniconda.sh -b -u -p /usr/local
rm ~/miniconda3/miniconda.sh
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

. /usr/local/bin/activate
echo "**$SHELL**"
echo "**$BASH**"
conda init --all

. /root/.bashrc
conda create -q --name my_tfx_env python=3.10 -y
conda activate my_tfx_env
python --version

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

PREFIX=/usr/local
Unpacking bootstrapper...
Unpacking payload...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /usr/local
accepted Terms of Service for https://repo.anaconda.com/pkgs/main
accepted Terms of Service for https://repo.anaconda.com/pkgs/r
**/bin/bash**
**/usr/bin/bash**
no change     /usr/local/condabin/conda
no change     /usr/local/bin/conda
no change     /usr/local/bin/conda-env
no change     /usr/local/bin/activate
no change     /usr/local/bin/deactivate
no change     /usr/local/etc/profile.d/conda.sh
no change     /usr/local/etc/fish/conf.d/conda.fish
no change   

to activate the conda environment, need to source from conda's activate (which I installed in /usr/local/bin above), then activate the conda virtual environment.

this has to be done for each magic shell cell

In [4]:
%%bash
t0=$(date +%s%N)
. /usr/local/bin/activate
conda activate my_tfx_env
python --version

#consider conda install ipykernel
conda install pip

#conda config --add channels conda-forge
#conda config --set channel_priority strict
#conda install python-snappy
# or:
#conda install anaconda::python-snappy

#see dependencies https://github.com/tensorflow/transform
pip -q install pyarrow==10.0.1
pip -q install apache-beam==2.59.0
pip -q install tensorflow==2.16.1
pip -q install tensorflow-transform==1.16.0
pip -q install tfx==1.16.0
pip -q install tensorflow-data-validation==1.16.1
pip -q install pytest
# installs:
#tf metadata 1.16.1
#tfx-bsl 1.16.1
#arrow 1.3.0
#keeps protobuf 3.20.3

#The Spark runner currently supports Spark’s 3.2.x branch.
#Apache Beam Prism Runner. 


pip list
t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 60000000000" | bc)
echo "$t2 minutes"
date
#about 6-7 minutes for this cell.

Python 3.10.18
2 channel Terms of Service accepted
Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

Package                            Version
---------------------------------- --------------
absl-py                            1.4.0
annotated-types                    0.7.0
anyio                              4.11.0
apache-beam                        2.59.0
argon2-cffi                        25.1.0
argon2-cffi-bindings               25.1.0
arrow                              1.3.0
astunparse                         1.6.3
async-lru                          2.0.5
async-timeout                      5.0.1
attrs                              23.2.0
babel                              2.17.0
backcall                           0.2.0
beautifulsoup4                     4.14.2
bleach                             6.2.0
cachetools                         5.5.2
certifi                       



    current version: 25.7.0
    latest version: 25.9.1

Please update conda by running

    $ conda update -n base -c defaults conda


  DEPRECATION: Building 'crcmod' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'crcmod'. Discussion can be found at https://github.com/pypa/pip/issues/6334
  DEPRECATION: Building 'dill' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'dill'. Discussion can be foun

In [5]:
!java --version

openjdk 11.0.27 2025-04-15
OpenJDK Runtime Environment (build 11.0.27+6-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.27+6-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)


In [6]:
%%bash
. /usr/local/bin/activate
conda activate my_tfx_env
python --version
pip show apache-beam

#refresh the test dirs
rm -rf /kaggle/working/bin/*

Python 3.10.18
Name: apache-beam
Version: 2.59.0
Summary: Apache Beam SDK for Python
Home-page: https://beam.apache.org
Author: Apache Software Foundation
Author-email: dev@beam.apache.org
License: Apache License, Version 2.0
Location: /usr/local/envs/my_tfx_env/lib/python3.10/site-packages
Requires: cloudpickle, crcmod, dill, fastavro, fasteners, grpcio, hdfs, httplib2, js2py, jsonpickle, jsonschema, numpy, objsize, orjson, packaging, proto-plus, protobuf, pyarrow, pyarrow-hotfix, pydot, pymongo, python-dateutil, pytz, redis, regex, requests, typing-extensions, zstandard
Required-by: tensorflow-data-validation, tensorflow-transform, tensorflow_model_analysis, tfx, tfx-bsl


The run_command is from
https://www.kaggle.com/code/taylorsamarel/change-python-version-kaggle-v2-taylor-amarel

In [7]:
import subprocess
def run_command(cmd, capture=True, check=False):
    cmds = f". /usr/local/bin/activate; conda activate my_tfx_env; {cmd}"
    try:
        result = subprocess.run(cmds, shell=True, capture_output=capture, text=True, check=check)
        if capture:
            return result.stdout.strip() if result.stdout else result.stderr.strip()
        return result.returncode == 0
    except Exception as e:
        return str(e)

In [8]:
print(run_command("python --version"))

Python 3.10.18


### 1.a. Download a TFX test script and test that the library versions are compatible

In [9]:
%%bash
. /usr/local/bin/activate
conda activate my_tfx_env

#it can take a couple of minutes to get current version of recently uploaded file to github
#wget -q -c --no-cache https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/main/src/test/python/test_tft.py -O /kaggle/working/test_tft.py
#curl --header "Cache-Control: no-cache" "https://api.github.com/repos/nking/recommender_systems/content/src/test/python/test_tft.py" -o /kaggle/working/test_tft.py

rm -f /kaggle/working/dataset_tfxio_example.py
wget -q -c --no-cache https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/main/src/test/python/dataset_tfxio_example.py -O /kaggle/working/dataset_tfxio_example.py

ls -l /kaggle/working

#run a test example from Google's TFX codebase:
python3 /kaggle/working/dataset_tfxio_example.py

date

total 4
-rw-r--r-- 1 root root 2392 Oct 16 21:45 dataset_tfxio_example.py
{'x_centered': [[-4.0], [-3.0], [-2.0], [-1.0], [0.0]],
 'x_scaled': [[0.0], [0.125], [0.25], [0.375], [0.5]]}
{'x_centered': [[1.0], [2.0], [3.0], [4.0]],
 'x_scaled': [[0.625], [0.75], [0.875], [1.0]]}
Thu Oct 16 09:45:21 PM UTC 2025


I1016 21:45:16.418321 138325445830464 pipeline.py:197] Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
I1016 21:45:18.620950 138325445830464 statecache.py:214] Creating state cache with size 104857600
I1016 21:45:18.850756 138325445830464 functional_saver.py:438] Sharding callback duration: 8
I1016 21:45:18.890327 138325445830464 functional_saver.py:438] Sharding callback duration: 8
INFO:tensorflow:Assets written to: /tmp/tmp3t77yz6b/tftransform_tmp/3d4d747a263e4a678bd82bda78b38685/assets
I1016 21:45:18.924497 138325445830464 builder_impl.py:829] Assets written to: /tmp/tmp3t77yz6b/tftransform_tmp/3d4d747a263e4a678bd82bda78b38685/assets
I1016 21:45:18.926936 138325445830464 fingerprinting_utils.py:49] Writing fingerprint to /tmp/tmp3t77yz6b/tftransform_tmp/3d4d747a263e4a678bd82bda78b38685/fingerprint.pb
INFO:tensorflow:struct2tensor is not available.
I1016 21:45:19.316430 138325445830464 saved_transform_io.py:166] struct2tensor is not avail

## 1.b. Download a MovieLens dataset

In [10]:
%%bash
wget -q http://files.grouplens.org/datasets/movielens/ml-1m.zip -O /kaggle/working/ml-1m.zip
unzip -o /kaggle/working/ml-1m.zip
ls /kaggle/working/ml-1m/
rm /kaggle/working/ml-1m.zip

head -n 5 /kaggle/working/ml-1m/ratings.dat
head -n 5 /kaggle/working/ml-1m/users.dat
head -n 5 /kaggle/working/ml-1m/movies.dat

#making small subsets for tests
head -n 1000 /kaggle/working/ml-1m/ratings.dat > /kaggle/working/ml-1m/ratings_1000.dat
head -n 100 /kaggle/working/ml-1m/users.dat > /kaggle/working/ml-1m/users_100.dat
    

Archive:  /kaggle/working/ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         
movies.dat
ratings.dat
README
users.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy


## 2. Write the TFX components, Beam PTransforms and unit tests
and upload them to a reachable repository.  
If the repository is private, you can use Kaggle secrets to hold API keys, etc for use in download below.

## 3. Download the components and transforms

### 3.a. Ingestion

The first component is the ingestion.  One version of the component was implemented with a python custom function as a decorated component and another component was written as a fully custom component.  Both use mostly the same custom apache beam PTransforms.
The python custom component is the preferred to use in the pipeline, though both produce similar results.

Customization was needed to ingest the 3 files ("ratings.dat", "movies.dat", "users.dat"), left join them on ratings, and then split them.
The components are called IngestMovieLensComponent and ingest_movie_lens_component for the fully customized  and the python function customized versions, respectively.

The code base also contains ingestion to pyarrow data-structures and reads and writes of Parquet files for other uses.

The code base is at github, user nking, repository recommender_systems and the next cell downloads the code and unit tests.

In [47]:
%%bash
#it can take a couple of minutes to get current version of recently uploaded file to github
#wget -q -c --no-cache https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/main/src/test/python/test_tft.py -O /kaggle/working/test_tft.py
#curl --header "Cache-Control: no-cache" "https://api.github.com/repos/nking/recommender_systems/content/src/test/python/test_tft.py" -o /kaggle/working/test_tft.py

repo_uri='https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/development/src/main/python'
declare -a my_files=("ingest_movie_lens_beam.py" "ingest_movie_lens_beam_pa.py"
  "CustomUTF8Coder.py" "ingest_movie_lens_component.py" 
  "movie_lens_utils.py" "ingest_movie_lens_custom_component.py"
  "transform_movie_lens.py"
)
for item in "${my_files[@]}"
do
  rm -f "/kaggle/working/$item"
  echo "$item"
  wget -q -c --no-cache "$repo_uri/$item" -O /kaggle/working/$item
done

#repo_uri='https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/development/src/drafts/python'
#declare -a my_files=()
#for item in "${my_files[@]}"
#do
#  rm -f "/kaggle/working/$item"
#  echo "$item"
#  wget -q -c --no-cache "$repo_uri/$item" -O /kaggle/working/$item
#done

repo_uri='https://raw.githubusercontent.com/nking/recommender_systems/refs/heads/development/src/test/python'
declare -a my_files=("ingest_movie_lens_beam_test.py" "ingest_movie_lens_beam_pa_test.py"
  "ingest_movie_lens_component_test.py" "ingest_movie_lens_custom_component_test.py" 
  "movie_lens_utils_test.py" "csv_example_gen_test.py" 
  "helper.py" "transform_movie_lens_test.py"
)
for item in "${my_files[@]}"
do
  rm -f "/kaggle/working/$item"
  echo "$item"
  wget -q -c --no-cache "$repo_uri/$item" -O /kaggle/working/$item
done

ls -l /kaggle/working/
date

ingest_movie_lens_beam.py
ingest_movie_lens_beam_pa.py
CustomUTF8Coder.py
ingest_movie_lens_component.py
movie_lens_utils.py
ingest_movie_lens_custom_component.py
transform_movie_lens.py
ingest_movie_lens_beam_test.py
ingest_movie_lens_beam_pa_test.py
ingest_movie_lens_component_test.py
ingest_movie_lens_custom_component_test.py
movie_lens_utils_test.py
csv_example_gen_test.py
helper.py
transform_movie_lens_test.py
total 164
drwxr-xr-x 7 root root  4096 Oct 16 21:48 bin
-rw-r--r-- 1 root root  7373 Oct 16 23:21 csv_example_gen_test.py
-rw-r--r-- 1 root root   777 Oct 16 23:21 CustomUTF8Coder.py
-rw-r--r-- 1 root root  2392 Oct 16 21:45 dataset_tfxio_example.py
-rw-r--r-- 1 root root  3633 Oct 16 23:21 helper.py
-rw-r--r-- 1 root root 10031 Oct 16 23:21 ingest_movie_lens_beam_pa.py
-rw-r--r-- 1 root root  4581 Oct 16 23:21 ingest_movie_lens_beam_pa_test.py
-rw-r--r-- 1 root root  7942 Oct 16 23:21 ingest_movie_lens_beam.py
-rw-r--r-- 1 root root  5111 Oct 16 23:21 ingest_movie_lens_beam

### Run the unit tests

In [12]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for CSVExampleGen"

t0=$(date +%s%N)

python -m unittest /kaggle/working/csv_example_gen_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

Python 3.10.18
run test for CSVExampleGen
TensorFlow version: 2.16.1
TFX version: 1.16.0
key=examples, value=OutputChannel(artifact_type=Examples, producer_component_id=CsvExampleGen, output_key=examples, additional_properties={}, additional_custom_properties={}, _input_trigger=None, _is_async=False)
listing files in output_data_dir /kaggle/working/bin/csv_comp_1/testRun:
/kaggle/working/bin/csv_comp_1/testRun/test_csvgenexample/CsvExampleGen/examples/1/Split-train/data_tfrecord-00000-of-00001.gz
/kaggle/working/bin/csv_comp_1/testRun/test_csvgenexample/CsvExampleGen/examples/1/Split-eval/data_tfrecord-00000-of-00001.gz
/kaggle/working/bin/csv_comp_1/testRun/test_csvgenexample/CsvExampleGen/examples/1/Split-test/data_tfrecord-00000-of-00001.gz
15.427272856 seconds
Thu Oct 16 09:45:42 PM UTC 2025


.s
----------------------------------------------------------------------
Ran 2 tests in 3.362s

OK (skipped=1)


In [13]:
%%bash
head -n 5 /kaggle/working/ml-1m/ratings.dat
head -n 5 /kaggle/working/ml-1m/users.dat
head -n 5 /kaggle/working/ml-1m/movies.dat

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy


In [14]:
!find /kaggle/working -type f


/kaggle/working/ingest_movie_lens_custom_component_test.py
/kaggle/working/ingest_movie_lens_component_test.py
/kaggle/working/__pycache__/CustomUTF8Coder.cpython-310.pyc
/kaggle/working/__pycache__/csv_example_gen_test.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_custom_component.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_beam.cpython-310.pyc
/kaggle/working/__pycache__/movie_lens_utils.cpython-310.pyc
/kaggle/working/ingest_movie_lens_beam_test.py
/kaggle/working/ml-1m/users.dat
/kaggle/working/ml-1m/users_100.dat
/kaggle/working/ml-1m/README
/kaggle/working/ml-1m/tmp/users2.dat
/kaggle/working/ml-1m/movies.dat
/kaggle/working/ml-1m/ratings_1000.dat
/kaggle/working/ml-1m/ratings.dat
/kaggle/working/bin/csv_comp_1/testRun/test_csvgenexample/CsvExampleGen/examples/1/Split-train/data_tfrecord-00000-of-00001.gz
/kaggle/working/bin/csv_comp_1/testRun/test_csvgenexample/CsvExampleGen/examples/1/Split-eval/data_tfrecord-00000-of-00001.gz
/kaggle/working/b

In [15]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for utils methods"

t0=$(date +%s%N)

python -m unittest /kaggle/working/movie_lens_utils_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

Python 3.10.18
run test for utils methods
5.496156890 seconds
Thu Oct 16 09:45:50 PM UTC 2025


...
----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK


In [16]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for beam transforms"

t0=$(date +%s%N)

python -m unittest /kaggle/working/ingest_movie_lens_beam_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

Python 3.10.18
run test for beam transforms
88.289876538 seconds
Thu Oct 16 09:47:20 PM UTC 2025


.s
----------------------------------------------------------------------
Ran 2 tests in 81.314s

OK (skipped=1)


In [17]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for tfx python function custom component"

t0=$(date +%s%N)

python -m unittest /kaggle/working/ingest_movie_lens_component_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date


Python 3.10.18
run test for tfx python function custom component
TensorFlow version: 2.16.1
TFX version: 1.16.0
23.616623388 seconds
Thu Oct 16 09:47:45 PM UTC 2025


.s
----------------------------------------------------------------------
Ran 2 tests in 11.335s

OK (skipped=1)


In [18]:
!find /kaggle/working -type f


/kaggle/working/ingest_movie_lens_custom_component_test.py
/kaggle/working/ingest_movie_lens_component_test.py
/kaggle/working/__pycache__/ingest_movie_lens_component_test.cpython-310.pyc
/kaggle/working/__pycache__/CustomUTF8Coder.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_component.cpython-310.pyc
/kaggle/working/__pycache__/csv_example_gen_test.cpython-310.pyc
/kaggle/working/__pycache__/helper.cpython-310.pyc
/kaggle/working/__pycache__/movie_lens_utils_test.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_beam_test.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_custom_component.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_beam.cpython-310.pyc
/kaggle/working/__pycache__/movie_lens_utils.cpython-310.pyc
/kaggle/working/ingest_movie_lens_beam_test.py
/kaggle/working/ml-1m/users.dat
/kaggle/working/ml-1m/users_100.dat
/kaggle/working/ml-1m/README
/kaggle/working/ml-1m/tmp/users2.dat
/kaggle/working/ml-1m/movies.dat
/kagg

In [19]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for TFX fully custom component"

t0=$(date +%s%N)

python -m unittest /kaggle/working/ingest_movie_lens_custom_component_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

Python 3.10.18
run test for TFX fully custom component
TensorFlow version: 2.16.1
TFX version: 1.16.0
26.208010227 seconds
Thu Oct 16 09:48:14 PM UTC 2025


.s
----------------------------------------------------------------------
Ran 3 tests in 14.161s

OK (skipped=1)


In [48]:
%%bash

. /usr/local/bin/activate
conda activate my_tfx_env

python --version

echo "run test for preprocessing Transform"

t0=$(date +%s%N)

python -m unittest /kaggle/working/transform_movie_lens_test.py

t1=$(date +%s%N)
t2=$(echo "scale=9;($t1-$t0) / 1000000000" | bc)
echo $t2 seconds
date

Python 3.10.18
run test for preprocessing Transform
running bdist_wheel
running build
running build_py
creating build/lib
copying ingest_movie_lens_custom_component_test.py -> build/lib
copying ingest_movie_lens_component_test.py -> build/lib
copying ingest_movie_lens_beam_test.py -> build/lib
copying ingest_movie_lens_beam.py -> build/lib
copying transform_movie_lens_test.py -> build/lib
copying transform_movie_lens.py -> build/lib
copying helper.py -> build/lib
copying ingest_movie_lens_beam_pa_test.py -> build/lib
copying csv_example_gen_test.py -> build/lib
copying CustomUTF8Coder.py -> build/lib
copying ingest_movie_lens_beam_pa.py -> build/lib
copying ingest_movie_lens_custom_component.py -> build/lib
copying ingest_movie_lens_component.py -> build/lib
copying dataset_tfxio_example.py -> build/lib
copying movie_lens_utils_test.py -> build/lib
copying movie_lens_utils.py -> build/lib
installing to /tmp/tmprhlikoh2
running install
running install_lib
copying build/lib/helper.py -> 

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Generating ephemeral wheel package for '/kaggle/working/transform_movie_lens.py' (including modules: ['ingest_movie_lens_custom_component_test', 'ingest_movie_lens_component_test', 'ingest_movie_lens_beam_test', 'ingest_movie_lens_beam', 'transform_movie_lens_test', 'transform_movie_lens', 'helper', 'ingest_movie_lens_beam_pa_test', 'csv_example_gen_test', 'CustomUTF8Coder', 'ingest_movie_lens_beam_pa', 'ingest_movie_lens_custom_component', 'ingest_movie_lens_component', 'dataset_tfxio_example', 'movie_lens_utils_test', 'movie_lens_utils']).
INFO:absl:User module package has hash fingerprint version 0b7b064cb9dcb4637476bc7cd0a8c78632c3ca089b039f2059496b531afa4162.
INFO:absl:Executing: ['/usr/local/envs/my_tfx_env/bin/python', '/tmp/tmpxqph3v13/_tfx_generated_setup.py', 'bdist_wheel', '--bdist-dir', '/tmp/tmprhlikoh2', '--dist-dir', '/tmp/tmpg0y6ti9

In [21]:
!find /kaggle/working -type f

/kaggle/working/ingest_movie_lens_custom_component_test.py
/kaggle/working/ingest_movie_lens_component_test.py
/kaggle/working/__pycache__/ingest_movie_lens_component_test.cpython-310.pyc
/kaggle/working/__pycache__/CustomUTF8Coder.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_component.cpython-310.pyc
/kaggle/working/__pycache__/csv_example_gen_test.cpython-310.pyc
/kaggle/working/__pycache__/helper.cpython-310.pyc
/kaggle/working/__pycache__/transform_movie_lens.cpython-310.pyc
/kaggle/working/__pycache__/movie_lens_utils_test.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_beam_test.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_custom_component.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_beam.cpython-310.pyc
/kaggle/working/__pycache__/ingest_movie_lens_custom_component_test.cpython-310.pyc
/kaggle/working/__pycache__/transform_movie_lens_test.cpython-310.pyc
/kaggle/working/__pycache__/movie_lens_utils.cpython-310.pyc


## 4. EDA

install and import polars, matplotlib, seaborn, dcor, scipystats, tensorflow

In [None]:
!pip install -q tensorflow
!pip install -q polars[all]
!pip install seaborn
!pip install -q matplotlib
!pip install dcor

#restart the kernel
from IPython import get_ipython
ipython = get_ipython()
if ipython is not None:
    ipython.kernel.do_shutdown(restart=True)
    print("Jupyter kernel is restarting...")
else:
    print("Not running in an IPython/Jupyter environment.")

In [None]:
import tensorflow as tf
import polars as pl
import matplotlib.pyplot as plt
#seaborn version installed is 0.12.2.  need>= 0.13.0 for polars
import seaborn as sns
from scipy.stats.distributions import chi2

pl.Config.set_fmt_str_lengths(900)

print(sns.__version__)

### 4.a. EDA on raw data

In [23]:
#https://pmc.ncbi.nlm.nih.gov/articles/PMC9191842/#:~:text=The%20chi%2Dsquare%20test%20is%20well%20behaved%20from%20the%20following,%E2%88%92%201%20(%201%20%E2%88%92%20%CE%B1%20)
#The Chi-Square Test of Distance Correlation

# a distance covariance-based chi-square test to test for independence 
# between two variables by calculating the bias-corrected sample 
# distance correlation and comparing it to a chi-square distribution. 
# handles both continuous and categorical data

import dcor
def can_reject_indep(x : np.array, y:np.array, alpha:float = 0.05, debug:bool=False):
  """
  reject independe for 
    n*C >= inv(F{chi^2-1})(1-alpha)
    where n = len(x)
      C = fast distance covariance following 2019 Chaudhuri and Hu
      inv(F{chi^2-1}) is the inverse of the CDF.
  """
  with np.errstate(divide='ignore'):
    C = dcor.distance_covariance(x, y, method='mergesort')
  lhs = len(x)*C
  rhs = chi2.ppf(1-alpha, df=x.shape[-1])
  if debug:
    print(f"nC={lhs}\nppf(1-{alpha}, dof={x.shape[-1]})={rhs}")
  return lhs >= rhs

ModuleNotFoundError: No module named 'dcor'

In [None]:
#EDA for raw data
from collections import OrderedDict
import re
import io
from datetime import datetime
import pytz

print(sns.__version__)

CTZ = pytz.timezone("America/Chicago")
genres = ["Action", "Adventure", "Animation", "Children", "Comedy",
          "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
          "Horror", "Musical", "Mystery", "Romance", "Sci-Fi",
          "Thriller", "War", "Western"]

schemas = {
  'ratings' : pl.Schema(OrderedDict({'user_id': pl.Int64, 
    'movie_id': pl.Int64, 'rating': pl.Int64,
    'timestamp' : pl.Int64})),
  'users' : pl.Schema(OrderedDict({'user_id': pl.Int64, 
    'gender': pl.String, 'age': pl.Int64,
    'occupation' : pl.Int64, 
    'zipcode' : pl.String})),
  'movies' : pl.Schema(OrderedDict({'movie_id': pl.Int64, 
    'title': pl.String, 'genres': pl.String}))}

file_paths = {
  'ratings':'/kaggle/working/ml-1m/ratings.dat',
  'users':'/kaggle/working/ml-1m/users.dat',
  'movies':'/kaggle/working/ml-1m/movies.dat'
}

#polars.read_csv( source=
#  encoding='iso-8859-1', 
#  has_header=False, skip_rows=0, try_parse_dates=True, 
#  use_pyarrow=True

labels_dict = {}
labels_dict['age_group'] = {0:'1', 1:'18', 2:'25', 3:'35', 4:'45', 5:'50', 6:'56'} 
labels_dict['gender'] = {0:'F', 1:'M'}
labels_dict['occupation'] = {0:  "other", 1:  "academic/educator", 2:  "artist", 3:  "clerical/admin", 4:  "college/grad student", 5:  "customer service", \
    6:  "doctor/health care", 7:  "executive/managerial", 8:  "farmer", 9:  "homemaker", 10:  "K-12 student", 11:  "lawyer", 12:  "programmer", \
    13:  "retired", 14:  "sales/marketing", 15:  "scientist", 16:  "self-employed", 17:  "technician/engineer", 18:  "tradesman/craftsman", \
    19:  "unemployed", 20:  "writer"}
labels_dict_arrays = {}
for k in labels_dict:
    labels_dict_arrays[k]=[labels_dict[k][k2] for k2 in labels_dict[k]]

for key in file_paths:
    processed_buffer = io.StringIO()
    file_path = file_paths[key]
    schema = schemas[key]
    print(f"key={key}, file_path={file_path}")
    with open(file_path, "r", encoding='iso-8859-1') as file:
        for line in file:
            line2 = line.replace('::', '\t')
            processed_buffer.write(line2)
    
    processed_buffer.seek(0)
    df = pl.read_csv(processed_buffer,\
        encoding='iso-8859-1', has_header=False, \
        skip_rows=0, separator='\t', schema=schema,\
        try_parse_dates=True, \
        new_columns=schema.names(), \
        use_pyarrow=True)

    if key=="movies":
        df = df.with_columns(
          pl.col("genres").str.replace("Children's", "Children")
        )
        df = df.with_columns(
          pl.col("genres").str.split("|")
        )
        movie_genres = df.explode('genres')
        ordered = movie_genres['genres'].value_counts().index
        sns.catplot(data=movie_genres, y="genres",  
          kind="count", order=ordered).set(title='Movie genres')
        plt.show()
                  
    if key=="ratings":
        #user_id, movie_id, rating, timestamp
        g = sns.catplot(data=df, x='rating',  kind="count").set(title='rating')
        local_time = datetime.fromtimestamp(df["timestamp"], tz=CTZ)
        df["hr"] = int(round(local_time.hour + (local_time.minute / 60.)))
        df["weekday"] = local_time.weekday()
        df["hr_wk"] = df["hr"] * 7 + df["weekday"]
        g = sns.catplot(data=df, x='hr',  kind="count").set(title='hr of day')
        g = sns.catplot(data=df, x='weekday',  kind="count").set(title='weekday')
        g = sns.catplot(data=df, x='hr_wk',  kind="count").set(title='hr of weekday')
        plt.show()
        x = df.select(pl.col("rating")).to_numpy()
        y = df.select(pl.col("hr_wk")).to_numpy()
        print(f"rating, hr_wk are indep: {can_reject_indep(x,y,0.05,True)}")

    if key=="users":
        #user_id, gender, age, occupation, zipcode
        g = sns.catplot(data=df, x='gender',  kind="count").set(title='gender')
        g = sns.catplot(data=df, x='age',  kind="count").set(title='age')
        g = sns.catplot(data=df, x='occupation',  kind="count").set(title='occupation')
        plt.show()
        


### 4.b. EDA on raw left-joined data, split into train, eval, test

In [None]:
#enable/disable by toggling from Markdown to Code
#"""

#TODO: finish here

print(f"tf.executing_eagerly()={tf.executing_eagerly()}")

tfrecord_path = "/kaggle/working/bin/py_custom_comp_1/test_MovieLensExampleGen/TestPythonFuncCustomCompPipeline/MovieLensExampleGen/output_examples/1/Split-train/data_tfrecord-00000-of-00001.tfrecord"

def get_expected_col_name_feature_types():
  return {"user_id": tf.io.FixedLenFeature([], tf.int64),
    "movie_id":tf.io.FixedLenFeature([], tf.int64),
    "rating" : tf.io.FixedLenFeature([], tf.int64),
    "timestamp" : tf.io.FixedLenFeature([], tf.int64),
    "gender" : tf.io.FixedLenFeature([], tf.string),
    "age" : tf.io.FixedLenFeature([], tf.int64), 
    "occupation" : tf.io.FixedLenFeature([], tf.int64),
    "genres" : tf.io.FixedLenFeature([], tf.string)}

col_name_feature_types = get_expected_col_name_feature_types()

def _parse_function(example_proto):
  return tf.io.parse_single_example(example_proto, col_name_feature_types)

dataset = tf.data.TFRecordDataset(tfrecord_path)
parsed_dataset = dataset.map(_parse_function)

try:
  for parsed_example in parsed_dataset.take(1):
    pass
except Exception as e:
  print(e)

try:
  for tfrecord in dataset.take(1):
    example = tf.train.Example()
    example.ParseFromString(tfrecord.numpy())
    #print(f"EXAMPLE={example}")
except Exception as e:
  print(e)     

data_list = []
for record in parsed_dataset:
    row = {
        'user_id': record['user_id'].numpy(),
        'movie_id': record['movie_id'].numpy(),
        'rating': record['rating'].numpy(),
        'age': record['age'].numpy(),
        'occupation': record['occupation'].numpy(),
        'genres': record['genres'].numpy().decode('utf-8'),
        'gender': record['gender'].numpy().decode('utf-8'),
    }
    data_list.append(row)

# Create the Polars DataFrame from the list of dictionaries
df = pl.from_records(data_list)

gender_enum = pl.Enum(['M', 'F'])
genres_enum = pl.Enum(["Action", "Adventure", "Animation", "Children", "Comedy",
          "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
          "Horror", "Musical", "Mystery", "Romance", "Sci-Fi",
          "Thriller", "War", "Western"])

#enums have to be strings
#age_enum = pl.Enum([1, 18, 25, 35, 45, 50, 56])
#df['age'].dtype = age_enum

df['gender'].cast(gender_enum)

# after splits, can use:
#df['genres'].cast(genres_enum)

df.null_count()

df.describe()
df.head()
#"""

### 4.c. EDA on transformed train, eval, test

In [None]:
#/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/
    #   MovieLensExampleGen/output_examples/1/
    #   Split-<train, eval, or test>/data_tfrecord-0000?-of-00004.tfrecord

# "/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/
    #   Transform/transformed_examples/4"

In [33]:
print(sns.__version__)
!pip show seaborn

0.12.2
Name: seaborn
Version: 0.13.2
Summary: Statistical data visualization
Home-page: 
Author: 
Author-email: Michael Waskom <mwaskom@gmail.com>
License: 
Location: /usr/local/lib/python3.13/site-packages
Requires: matplotlib, numpy, pandas
Required-by: 


In [42]:
#!find /kaggle/working/bin -type d -name "transformed_examples" | xargs ls -lR
!find /kaggle/working/bin -type f -iname "transformed_examples*.gz"

/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-train/transformed_examples-00000-of-00004.gz
/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-train/transformed_examples-00002-of-00004.gz
/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-train/transformed_examples-00001-of-00004.gz
/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-train/transformed_examples-00003-of-00004.gz
/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-eval/transformed_examples-00000-of-00004.gz
/kaggle/working/bin/transform_1/test_MovieLensExampleGen/TestPythonTransformPipeline/Transform/transformed_examples/4/Split-eval/transformed_examples-00002-