# Crafting Custom Logic with MLflow’s PythonModel

mlflow.pyfunc
The python_function model flavor serves as a default model interface for MLflow Python models. Any MLflow Python model is expected to be loadable as a python_function model.

In addition, the mlflow.pyfunc module defines a generic filesystem format for Python models and provides utilities for saving to and loading from this format. The format is self contained in the sense that it includes all necessary information for anyone to load it and use it. Dependencies are either stored directly with the model or referenced via a Conda environment.

The mlflow.pyfunc module also defines utilities for creating custom pyfunc models using frameworks and inference logic that may not be natively included in MLflow


custom implementation for semantic search using MLflow and Sentence Transformers

# pyfunc

https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom

https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.PyFuncModel

# CloudPickle
https://github.com/cloudpipe/cloudpickle

cloudpickle can serialize Python functions, lambda functions, and locally defined classes and functions inside other functions. This makes cloudpickle especially useful for parallel and distributed computing where code objects need to be sent over network to execute on remote workers


# Function-based Model vs Class-based Model
When creating custom PyFunc models, you can choose between two different interfaces: a function-based model and a class-based model. In short, a function-based model is simply a python function that does not take additional params. The class-based model, on the other hand, is subclass of PythonModel that supports several required and optional methods. If your use case is simple and fits within a single predict function, a funcion-based approach is recommended. If you need more power, such as custom serialization, custom data processing, or to override additional methods, you should use the class-based implementation.

In [3]:

%pip install mlflow==2.11.2 -q
%pip install transformers==4.39.3  pyngrok datasets sentence_transformers -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
NGROK = user_secrets.get_secret("NGROK")

In [5]:

from pyngrok import ngrok

get_ipython().system_raw("mlflow ui --port 5000 &")


# Terminate open tunnels if exist
ngrok.kill()

* 'schema_extra' has been renamed to 'json_schema_extra'
[2024-05-18 20:14:25 +0000] [139] [INFO] Starting gunicorn 21.2.0
[2024-05-18 20:14:25 +0000] [139] [INFO] Listening at: http://127.0.0.1:5000 (139)
[2024-05-18 20:14:25 +0000] [139] [INFO] Using worker: sync
[2024-05-18 20:14:25 +0000] [140] [INFO] Booting worker with pid: 140
[2024-05-18 20:14:25 +0000] [141] [INFO] Booting worker with pid: 141
[2024-05-18 20:14:25 +0000] [142] [INFO] Booting worker with pid: 142
[2024-05-18 20:14:25 +0000] [143] [INFO] Booting worker with pid: 143


In [6]:
ngrok.set_auth_token(NGROK)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

MLflow Tracking UI: https://6e48-34-168-103-30.ngrok-free.app                                       


In [10]:
import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

In [9]:
import warnings

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel

# CloudPickle
https://github.com/cloudpipe/cloudpickle

cloudpickle can serialize Python functions, lambda functions, and locally defined classes and functions inside other functions. This makes cloudpickle especially useful for parallel and distributed computing where code objects need to be sent over network to execute on remote workers

# https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.log_model

# Function-based Model

If you’re looking to serialize a simple python function without additional dependent methods, you can simply log a predict method via the keyword argument python_model

```
import mlflow
import pandas as pd

# Define a simple function to log
def predict(model_input):
    return model_input.apply(lambda x: x * 2)

# Save the function as a model
with mlflow.start_run():
    mlflow.pyfunc.log_model("model", python_model=predict, pip_requirements=["pandas"])
    run_id = mlflow.active_run().info.run_id

# Load the model from the tracking server and perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
x_new = pd.Series([1,2,3])

prediction = model.predict(x_new)
print(prediction)
```

# Class-based Model
If you’re looking to serialize a more complex object, for instance a class that handles preprocessing, complex prediction logic, or custom serialization, you should subclass the PythonModel class.

```
import mlflow
import pandas as pd

class MyModel(mlflow.pyfunc.PythonModel):
    def predict(self, context, model_input, params=None):
        return [x*2 for x in model_input]

# Save the function as a model
with mlflow.start_run():
    mlflow.pyfunc.log_model("model", python_model=MyModel(), pip_requirements=["pandas"])
    run_id = mlflow.active_run().info.run_id

# Load the model from the tracking server and perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
x_new = pd.Series([1, 2, 3])

print(f"Prediction:
    {model.predict(x_new)}")
```

In [9]:
# Python function models are loaded as an instance of PyFuncModel, which is an MLflow wrapper around 
# the model implementation and model metadata (MLmodel file)
# https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.PyFuncModel
help(PythonModel)

Help on class PythonModel in module mlflow.pyfunc.model:

class PythonModel(builtins.object)
 |  Represents a generic Python model that evaluates inputs and produces API-compatible outputs.
 |  By subclassing :class:`~PythonModel`, users can create customized MLflow models with the
 |  "python_function" ("pyfunc") flavor, leveraging custom inference logic and artifact
 |  dependencies.
 |  
 |  Methods defined here:
 |  
 |  load_context(self, context)
 |      Loads artifacts from the specified :class:`~PythonModelContext` that can be used by
 |      :func:`~PythonModel.predict` when evaluating inputs. When loading an MLflow model with
 |      :func:`~load_model`, this method is called as soon as the :class:`~PythonModel` is
 |      constructed.
 |      
 |      The same :class:`~PythonModelContext` will also be available during calls to
 |      :func:`~PythonModel.predict`, but it may be more efficient to override this method
 |      and load artifacts from the context at model load t

# Advanced Semantic Search with Sentence Transformers and MLflow

In [11]:
class SemanticSearchModel(PythonModel):
    def load_context(self, context):
        """Load the model context for inference, including the corpus from a file."""
        try:
            # Load the pre-trained sentence transformer model
            self.model = SentenceTransformer.load(context.artifacts["model_path"])

            # Load the corpus from the specified file
            corpus_file = context.artifacts["corpus_file"]
            with open(corpus_file) as file:
                self.corpus = file.read().splitlines()

            # Encode the corpus and convert it to a tensor
            self.corpus_embeddings = self.model.encode(self.corpus, convert_to_tensor=True)

        except Exception as e:
            raise ValueError(f"Error loading model and corpus: {e}")

    def predict(self, context, model_input, params=None):
        """Predict method to perform semantic search over the corpus."""

        if isinstance(model_input, pd.DataFrame):
            if model_input.shape[1] != 1:
                raise ValueError("DataFrame input must have exactly one column.")
            model_input = model_input.iloc[0, 0]
        elif isinstance(model_input, dict):
            model_input = model_input.get("sentence")
            if model_input is None:
                raise ValueError("The input dictionary must have a key named 'sentence'.")
        else:
            raise TypeError(
                f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame."
            )

        # Encode the query
        query_embedding = self.model.encode(model_input, convert_to_tensor=True)

        # Compute cosine similarity scores
        cos_scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]

        # Determine the number of top results to return
        top_k = params.get("top_k", 3) if params else 3  # Default to 3 if not specified

        minimum_relevancy = (
            params.get("minimum_relevancy", 0.2) if params else 0.2
        )  # Default to 0.2 if not specified

        # Get the top_k most similar sentences from the corpus
        top_results = np.argsort(cos_scores, axis=0)[-top_k:]

        # Prepare the initial results list
        initial_results = [
            (self.corpus[idx], cos_scores[idx].item()) for idx in reversed(top_results)
        ]

        # Filter the results based on the minimum relevancy threshold
        filtered_results = [result for result in initial_results if result[1] >= minimum_relevancy]

        # If all results are below the threshold, issue a warning and return the top result
        if not filtered_results:
            warnings.warn(
                "All top results are below the minimum relevancy threshold. "
                "Returning the highest match instead.",
                RuntimeWarning,
            )
            return [initial_results[0]]
        else:
            return filtered_results

In [12]:
corpus = [
    "Perfecting a Sourdough Bread Recipe: The Joy of Baking. Baking sourdough bread "
    "requires patience, skill, and a good understanding of yeast fermentation. Each "
    "loaf is unique, telling its own story of the baker's journey.",
    "The Mars Rover's Discoveries: Unveiling the Red Planet. NASA's Mars rover has "
    "sent back stunning images and data, revealing the planet's secrets. These "
    "discoveries may hold the key to understanding Mars' history.",
    "The Art of Growing Herbs: Enhancing Your Culinary Skills. Growing your own "
    "herbs can transform your cooking, adding fresh and vibrant flavors. Whether it's "
    "basil, thyme, or rosemary, each herb has its own unique characteristics.",
    "AI in Software Development: Transforming the Tech Landscape. The rapid "
    "advancements in artificial intelligence are reshaping how we approach software "
    "development. From automation to machine learning, the possibilities are endless.",
    "Backpacking Through Europe: A Journey of Discovery. Traveling across Europe by "
    "backpack allows one to immerse in diverse cultures and landscapes. It's an "
    "adventure that combines the thrill of exploration with personal growth.",
    "Shakespeare's Timeless Influence: Reshaping Modern Storytelling. The works of "
    "William Shakespeare continue to inspire and influence contemporary literature. "
    "His mastery of language and deep understanding of human nature are unparalleled.",
    "The Rise of Renewable Energy: A Sustainable Future. Embracing renewable energy "
    "is crucial for achieving a sustainable and environmentally friendly lifestyle. "
    "Solar, wind, and hydro power are leading the way in this green revolution.",
    "The Magic of Jazz: An Exploration of Sound and Harmony. Jazz music, known for "
    "its improvisation and complex harmonies, has a rich and diverse history. It "
    "evokes a range of emotions, often reflecting the soul of the musician.",
    "Yoga for Mind and Body: The Benefits of Regular Practice. Engaging in regular "
    "yoga practice can significantly improve flexibility, strength, and mental "
    "well-being. It's a holistic approach to health, combining physical and spiritual "
    "aspects.",
    "The Egyptian Pyramids: Monuments of Ancient Majesty. The ancient Egyptian "
    "pyramids, monumental tombs for pharaohs, are marvels of architectural "
    "ingenuity. They stand as a testament to the advanced skills of ancient builders.",
    "Vegan Cuisine: A World of Flavor. Exploring vegan cuisine reveals a world of "
    "nutritious and delicious possibilities. From hearty soups to delectable desserts, "
    "plant-based dishes are diverse and satisfying.",
    "Extraterrestrial Life: The Endless Search. The quest to find life beyond Earth "
    "continues to captivate scientists and the public alike. Advances in space "
    "technology are bringing us closer to answering this age-old question.",
    "The Art of Plant Pruning: Promoting Healthy Growth. Regular pruning is essential "
    "for maintaining healthy and vibrant plants. It's not just about cutting back, but "
    "understanding each plant's growth patterns and needs.",
    "Cybersecurity in the Digital Age: Protecting Our Data. With the rise of digital "
    "technology, cybersecurity has become a critical concern. Protecting sensitive "
    "information from cyber threats is an ongoing challenge for individuals and "
    "businesses alike.",
    "The Great Wall of China: A Historical Journey. Visiting the Great Wall offers "
    "more than just breathtaking views; it's a journey through history. This ancient "
    "structure tells stories of empires, invasions, and human resilience.",
    "Mystery Novels: Crafting Suspense and Intrigue. A great mystery novel captivates "
    "the reader with intricate plots and unexpected twists. It's a genre that combines "
    "intellectual challenge with entertainment.",
    "Conserving Endangered Species: A Global Effort. Protecting endangered species "
    "is a critical task that requires international collaboration. From rainforests to "
    "oceans, every effort counts in preserving our planet's biodiversity.",
    "Emotions in Classical Music: A Symphony of Feelings. Classical music is not just "
    "an auditory experience; it's an emotional journey. Each composition tells a story, "
    "conveying feelings from joy to sorrow, tranquility to excitement.",
    "CrossFit: A Test of Strength and Endurance. CrossFit is more than just a fitness "
    "regimen; it's a lifestyle that challenges your physical and mental limits. It "
    "combines various disciplines to create a comprehensive workout.",
    "The Renaissance: An Era of Artistic Genius. The Renaissance marked a period of "
    "extraordinary artistic and scientific achievements. It was a time when creativity "
    "and innovation flourished, reshaping the course of history.",
    "Exploring International Cuisines: A Culinary Adventure. Discovering international "
    "cuisines is an adventure for the palate. Each dish offers a glimpse into the "
    "culture and traditions of its origin.",
    "Astronaut Training: Preparing for the Unknown. Becoming an astronaut involves "
    "rigorous training to prepare for the extreme conditions of space. It's a journey "
    "that tests both physical endurance and mental resilience.",
    "Sustainable Gardening: Nurturing the Environment. Sustainable gardening is not "
    "just about growing plants; it's about cultivating an ecosystem. By embracing "
    "environmentally friendly practices, gardeners can have a positive impact on the "
    "planet.",
    "The Smartphone Revolution: Changing Communication. Smartphones have transformed "
    "how we communicate, offering unprecedented connectivity and convenience. This "
    "technology continues to evolve, shaping our daily interactions.",
    "Experiencing African Safaris: Wildlife and Wilderness. An African safari is an "
    "unforgettable experience that brings you face-to-face with the wonders of "
    "wildlife. It's a journey that connects you with the raw beauty of nature.",
    "Graphic Novels: A Blend of Art and Story. Graphic novels offer a unique medium "
    "where art and narrative intertwine to tell compelling stories. They challenge "
    "traditional forms of storytelling, offering visual and textual richness.",
    "Addressing Ocean Pollution: A Call to Action. The increasing levels of pollution "
    "in our oceans are a pressing environmental concern. Protecting marine life and "
    "ecosystems requires concerted global efforts.",
    "The Origins of Hip Hop: A Cultural Movement. Hip hop music, originating from the "
    "streets of New York, has grown into a powerful cultural movement. Its beats and "
    "lyrics reflect the experiences and voices of a community.",
    "Swimming: A Comprehensive Workout. Swimming offers a full-body workout that is "
    "both challenging and refreshing. It's an exercise that enhances cardiovascular "
    "health, builds muscle, and improves endurance.",
    "The Fall of the Berlin Wall: A Historical Turning Point. The fall of the Berlin "
    "Wall was not just a physical demolition; it was a symbol of political and social "
    "change. This historic event marked the end of an era and the beginning of a new "
    "chapter in world history.",
]

# Write the corpus to a file
corpus_file = "/kaggle/working/search_corpus.txt"
with open(corpus_file, "w") as file:
    for sentence in corpus:
        file.write(sentence + "\n")

In [13]:
# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = ["Something I want to find matches for."]

# Save the model in the /tmp directory
model_directory = "/kaggle/working/mlruns/search_model"
model.save(model_directory)

artifacts = {"model_path": model_directory, "corpus_file": corpus_file}

# Generate test output for signature
test_output = ["match 1", "match 2", "match 3"]

# Define the signature associated with the model
signature = infer_signature(
    input_example, test_output, params={"top_k": 3, "minimum_relevancy": 0.2}
)

# Visualize the signature
signature

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

inputs: 
  [string (required)]
outputs: 
  [string (required)]
params: 
  ['top_k': long (default: 3), 'minimum_relevancy': double (default: 0.2)]

In [14]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.set_experiment("Semantic Similarity")

2024/05/18 20:25:10 INFO mlflow.tracking.fluent: Experiment with name 'Semantic Similarity' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/818146615553542844', creation_time=1716063910152, experiment_id='818146615553542844', last_update_time=1716063910152, lifecycle_stage='active', name='Semantic Similarity', tags={}>

In [16]:
from datetime import datetime
import pandas as pd
name = "Semantic_Similarity_" +datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
with mlflow.start_run(run_name = name) as run:
    model_info = mlflow.pyfunc.log_model(
        "semantic_search",
        python_model=SemanticSearchModel(),
        input_example=input_example,
        signature=signature,
        artifacts=artifacts,
        pip_requirements=["sentence_transformers", "numpy"],
    )

2024/05/18 20:26:33 INFO mlflow.models.utils: Lists of scalar values are not converted to a pandas DataFrame. If you expect to use pandas DataFrames for inference, please construct a DataFrame and pass it to input_example instead.


Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2024/05/18 20:26:33 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
model_info.model_uri

'runs:/aa71ba59137540b29a58738ca05d6aa5/semantic_search'

In [18]:
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

# Make sure that it generates a reasonable output
loaded_dynamic.predict(["I'd like some ideas for a meal to cook."], params={"top_k": 4, "minimum_relevancy": 0.25})

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 797.85it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2813.08it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 4691.62it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3004.52it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 871.45it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1778.75it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 670.55it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1522.99it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3106.89it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 700.92it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2646.25it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2056.03it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  6.27it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2636.27it/s]
Download

Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 373.29it/s]
2024/05/18 20:27:52 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 838.69it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1440.85it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 269.49it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 741.96it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1055.44it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  6.47it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 395.91it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 521.36it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3189.58it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3426.72it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[('Exploring International Cuisines: A Culinary Adventure. Discovering international cuisines is an adventure for the palate. Each dish offers a glimpse into the culture and traditions of its origin.',
  0.43857109546661377),
 ('Vegan Cuisine: A World of Flavor. Exploring vegan cuisine reveals a world of nutritious and delicious possibilities. From hearty soups to delectable desserts, plant-based dishes are diverse and satisfying.',
  0.3468847870826721)]

In [19]:
loaded_dynamic.predict(
    ["Latest stories on computing"], params={"top_k": 10, "minimum_relevancy": 0.4}
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



[('AI in Software Development: Transforming the Tech Landscape. The rapid advancements in artificial intelligence are reshaping how we approach software development. From automation to machine learning, the possibilities are endless.',
  0.2533860206604004)]

# Advanced Paraphrase Mining with Sentence Transformers and MLflow


In [20]:
import warnings
from typing import List

class ParaphraseMiningModel(PythonModel):
    def load_context(self, context):
        """Load the model context for inference, including the customer feedback corpus."""
        try:
            # Load the pre-trained sentence transformer model
            self.model = SentenceTransformer.load(context.artifacts["model_path"])

            # Load the customer feedback corpus from the specified file
            corpus_file = context.artifacts["corpus_file"]
            with open(corpus_file) as file:
                self.corpus = file.read().splitlines()

        except Exception as e:
            raise ValueError(f"Error loading model and corpus: {e}")

    def _sort_and_filter_matches(
        self, query: str, paraphrase_pairs: List[tuple], similarity_threshold: float
    ):
        """Sort and filter the matches by similarity score."""

        # Convert to list of tuples and sort by score
        sorted_matches = sorted(paraphrase_pairs, key=lambda x: x[1], reverse=True)

        # Filter and collect paraphrases for the query, avoiding duplicates
        query_paraphrases = {}
        for score, i, j in sorted_matches:
            if score < similarity_threshold:
                continue

            paraphrase = self.corpus[j] if self.corpus[i] == query else self.corpus[i]
            if paraphrase == query:
                continue

            if paraphrase not in query_paraphrases or score > query_paraphrases[paraphrase]:
                query_paraphrases[paraphrase] = score

        return sorted(query_paraphrases.items(), key=lambda x: x[1], reverse=True)

    def predict(self, context, model_input, params=None):
        """Predict method to perform paraphrase mining over the corpus."""

        # Validate and extract the query input
        if isinstance(model_input, pd.DataFrame):
            if model_input.shape[1] != 1:
                raise ValueError("DataFrame input must have exactly one column.")
            query = model_input.iloc[0, 0]
        elif isinstance(model_input, dict):
            query = model_input.get("query")
            if query is None:
                raise ValueError("The input dictionary must have a key named 'query'.")
        else:
            raise TypeError(
                f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame."
            )

        # Determine the minimum similarity threshold
        similarity_threshold = params.get("similarity_threshold", 0.5) if params else 0.5

        # Add the query to the corpus for paraphrase mining
        extended_corpus = self.corpus + [query]

        # Perform paraphrase mining
        paraphrase_pairs = util.paraphrase_mining(
            self.model, extended_corpus, show_progress_bar=False
        )

        # Convert to list of tuples and sort by score
        sorted_paraphrases = self._sort_and_filter_matches(
            query, paraphrase_pairs, similarity_threshold
        )

        # Warning if no paraphrases found
        if not sorted_paraphrases:
            warnings.warn("No paraphrases found above the similarity threshold.", UserWarning)

        return {sentence[0]: str(sentence[1]) for sentence in sorted_paraphrases}

In [21]:
corpus = [
    "Exploring ancient cities in Europe offers a glimpse into history.",
    "Modern AI technologies are revolutionizing industries.",
    "Healthy eating contributes significantly to overall well-being.",
    "Advancements in renewable energy are combating climate change.",
    "Learning a new language opens doors to different cultures.",
    "Gardening is a relaxing hobby that connects you with nature.",
    "Blockchain technology could redefine digital transactions.",
    "Homemade Italian pasta is a delight to cook and eat.",
    "Practicing yoga daily improves both physical and mental health.",
    "The art of photography captures moments in time.",
    "Baking bread at home has become a popular quarantine activity.",
    "Virtual reality is creating new experiences in gaming.",
    "Sustainable travel is becoming a priority for eco-conscious tourists.",
    "Reading books is a great way to unwind and learn.",
    "Jazz music provides a rich tapestry of sound and rhythm.",
    "Marathon training requires discipline and perseverance.",
    "Studying the stars helps us understand our universe.",
    "The rise of electric cars is an important environmental development.",
    "Documentary films offer deep insights into real-world issues.",
    "Crafting DIY projects can be both fun and rewarding.",
    "The history of ancient civilizations is fascinating to explore.",
    "Exploring the depths of the ocean reveals a world of marine wonders.",
    "Learning to play a musical instrument can be a rewarding challenge.",
    "Artificial intelligence is shaping the future of personalized medicine.",
    "Cycling is not only a great workout but also eco-friendly transportation.",
    "Home automation with IoT devices is enhancing living experiences.",
    "Understanding quantum computing requires a grasp of complex physics.",
    "A well-brewed cup of coffee is the perfect start to the day.",
    "Urban farming is gaining popularity as a sustainable food source.",
    "Meditation and mindfulness can lead to a more balanced life.",
    "The popularity of podcasts has revolutionized audio storytelling.",
    "Space exploration continues to push the boundaries of human knowledge.",
    "Wildlife conservation is essential for maintaining biodiversity.",
    "The fusion of technology and fashion is creating new trends.",
    "E-learning platforms have transformed the educational landscape.",
    "Dark chocolate has surprising health benefits when enjoyed in moderation.",
    "Robotics in manufacturing is leading to more efficient production.",
    "Creating a personal budget is key to financial well-being.",
    "Hiking in nature is a great way to connect with the outdoors.",
    "3D printing is innovating the way we create and manufacture objects.",
    "Sommeliers can identify a wine's characteristics with just a taste.",
    "Mind-bending puzzles and riddles are great for cognitive exercise.",
    "Social media has a profound impact on communication and culture.",
    "Urban sketching captures the essence of city life on paper.",
    "The ethics of AI is a growing field in tech philosophy.",
    "Homemade skincare remedies are becoming more popular.",
    "Virtual travel experiences can provide a sense of adventure at home.",
    "Ancient mythology still influences modern storytelling and literature.",
    "Building model kits is a hobby that requires patience and precision.",
    "The study of languages opens windows into different worldviews.",
    "Professional esports has become a major global phenomenon.",
    "The mysteries of the universe are unveiled through space missions.",
    "Astronauts' experiences in space stations offer unique insights into life beyond Earth.",
    "Telescopic observations bring distant galaxies within our view.",
    "The study of celestial bodies helps us understand the cosmos.",
    "Space travel advancements could lead to interplanetary exploration.",
    "Observing celestial events provides valuable data for astronomers.",
    "The development of powerful rockets is key to deep space exploration.",
    "Mars rover missions are crucial in searching for extraterrestrial life.",
    "Satellites play a vital role in our understanding of Earth's atmosphere.",
    "Astrophysics is central to unraveling the secrets of space.",
    "Zero gravity environments in space pose unique challenges and opportunities.",
    "Space tourism might soon become a reality for many.",
    "Lunar missions have contributed significantly to our knowledge of the moon.",
    "The International Space Station is a hub for groundbreaking space research.",
    "Studying comets and asteroids reveals information about the early solar system.",
    "Advancements in space technology have implications for many scientific fields.",
    "The possibility of life on other planets continues to intrigue scientists.",
    "Black holes are among the most mysterious phenomena in space.",
    "The history of space exploration is filled with remarkable achievements.",
    "Future space missions could unlock the mysteries of dark matter.",
]

# Write out the corpus to a file
corpus_file = "/kaggle/working/feedback.txt"
with open(corpus_file, "w") as file:
    for sentence in corpus:
        file.write(sentence + "\n")

In [23]:
# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = pd.DataFrame({"query": ["This product works well. I'm satisfied."]})

# Save the model in the /tmp directory
model_directory = "/kaggle/working/paraphrase_search_model"
model.save(model_directory)

# Define the path for the corpus file
corpus_file = "/kaggle/working/feedback.txt"

# Define the artifacts (paths to the model and corpus file)
artifacts = {"model_path": model_directory, "corpus_file": corpus_file}

# Generate test output for signature
# Sample output for paraphrase mining could be a list of tuples (paraphrase, score)
test_output = [{"This product is satisfactory and functions as expected.": "0.8"}]

# Define the signature associated with the model
# The signature includes the structure of the input and the expected output, as well as any parameters that
# we would like to expose for overriding at inference time (including their default values if they are not overridden).
signature = infer_signature(
    model_input=input_example, model_output=test_output, params={"similarity_threshold": 0.5}
)

# Visualize the signature, showing our overridden inference parameter and its default.
signature

inputs: 
  ['query': string (required)]
outputs: 
  ['This product is satisfactory and functions as expected.': string (required)]
params: 
  ['similarity_threshold': double (default: 0.5)]

In [24]:
# Set experiment
mlflow.set_experiment("Paraphrase Mining")

2024/05/18 20:31:09 INFO mlflow.tracking.fluent: Experiment with name 'Paraphrase Mining' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/950618281688856821', creation_time=1716064269369, experiment_id='950618281688856821', last_update_time=1716064269369, lifecycle_stage='active', name='Paraphrase Mining', tags={}>

In [25]:
name = "Paraphrase_Mining_" +datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
with mlflow.start_run(run_name = name) as run:
    model_info = mlflow.pyfunc.log_model(
        "paraphrase_model",
        python_model=ParaphraseMiningModel(),
        input_example=input_example,
        signature=signature,
        artifacts=artifacts,
        pip_requirements=["sentence_transformers"],
    )

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2024/05/18 20:31:13 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

In [26]:
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

# Perform a quick validation that our loaded model is performing adequately
loaded_dynamic.predict(
    {"query": "Space exploration is fascinating."}, params={"similarity_threshold": 0.65}
)

Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3669.56it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 986.20it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1019.02it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1201.46it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1224.61it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1068.34it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2822.55it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3137.10it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1151.65it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1990.65it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 745.12it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 3113.81it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 946.58it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1246.82it/s]
Down

Downloading artifacts:   0%|          | 0/35 [00:00<?, ?it/s]

2024/05/18 20:32:07 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1276.80it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1038.19it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 942.96it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 394.28it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 942.75it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 591.16it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  7.11it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 356.69it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 1112.55it/s]
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 930.62it/s] 
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 5023.12it/s] 
Downloading artifacts: 100%|██████████| 1/1 [

{'Studying the stars helps us understand our universe.': '0.8207423686981201',
 'The history of space exploration is filled with remarkable achievements.': '0.77706378698349',
 'Exploring ancient cities in Europe offers a glimpse into history.': '0.7461956739425659',
 'Space travel advancements could lead to interplanetary exploration.': '0.7090303897857666',
 'Space exploration continues to push the boundaries of human knowledge.': '0.6893945932388306',
 'The mysteries of the universe are unveiled through space missions.': '0.6830741167068481',
 'The study of celestial bodies helps us understand the cosmos.': '0.6713583469390869'}