# DSC670 - Week9- GAI Production Deployment - MLFLow Observability Example

Recreate the MLFlow example outlined in the text and shown in this code.

https://github.com/bahree/GenAIBook/blob/main/chapters/ch11/Listing-11.4_mlflow_observability.py

This will show how to run the model in a very simple "app," with the MLFlow monitoring running.r own words.ds.



Install some packages to make this work, including:

mlflow
tiktoken (optional, the author has a simple token counter)
prometheus-client

Run the command to start MLFLow Server
## mlflow server

mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts

or 

mlflow server --host 0.0.0.0 --port 5000

## Access the url using

http://127.0.0.1:5000

## What is MLFlow?
MLflow is an open-source platform designed to help manage the complete machine learning (ML) lifecycle. It provides tools that make it easier for data scientists and engineers to track experiments, package code, and deploy models. The main goal of MLflow is to improve collaboration and organization in machine learning projects by making experiments reproducible and models easier to share.

MLflow includes four main components: Tracking, Projects, Models, and Model Registry. The Tracking component helps record and compare model parameters and results. Projects allow users to organize and package code so others can easily run it. Models define a standard way to manage and deploy trained models. The Model Registry serves as a central place to store, version, and manage models throughout their lifecycle.

Overall, MLflow simplifies the process of developing, testing, and deploying machine learning models, ensuring that teams can work together efficiently and maintain consistency across their workflows.


In [3]:
# Lets you build simple interactive web apps for your ML/LLM projects
#pip install streamlit

In [4]:
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Install rquired libraries
#mlflow - Logs experiments, metrics, and model runs
# tiktoken - Token counting for OpenAI models
# colorama - Adds colored output to terminal/Jupyter prints
#pip install openai mlflow tiktoken colorama

In [6]:
# %% [markdown]
# Generative AI with OpenAI + MLflow
# 
# This notebook demonstrates how to:
# - Use the OpenAI API for text generation  
# - Log key metrics and parameters with MLflow  
# - Count tokens using `tiktoken`  
# - Track latency and token usage for each prompt  
# - Create a conversational AI loop

# %%
import os
import time
import mlflow
from openai import OpenAI
import tiktoken as tk
from colorama import Fore, Style, init
## Load required libraries
import requests
import json
import os
from dotenv import load_dotenv

#Load variables from .env file into environment
load_dotenv()

# %% [markdown]
# ## Configuration and Initialization
# - Set your OpenAI API key (from environment variable)
# - Configure MLflow tracking URI
# - Initialize helper variables

# %%
# Set OpenAI API key and model parameters
API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-3.5-turbo"
MLFLOW_URI = "http://localhost:5000" # This is the url where MLFlow will store the data that we log and also assign the name of the experiment.

#TEMPERATURE controls how creative or random the model’s responses are.
#Low (0.2–0.4): More focused and predictable answers.
#High (0.7–1.0): More creative and varied responses.
# A value of 0.7 gives a good balance between creativity and accuracy.
TEMPERATURE = 0.7

# TOP_P is part of nucleus sampling, another randomness control.
#TOP_P = 1 means the model considers all possible word choices.
#You can lower it (like 0.8) to make the model focus only on the most likely words.
TOP_P = 1

# FREQUENCY_PENALTY - This reduces how often the model repeats the same phrases.
# 0 means no penalty — the model can repeat itself if it decides to.
# Increasing this (e.g., 0.5) makes the model avoid repetition.
FREQUENCY_PENALTY = 0

# This affects how often the model brings up new topics.
# 0 means the model can freely continue discussing the same subject.
# A higher value encourages it to explore new ideas instead of repeating.
PRESENCE_PENALTY = 0

#This sets the maximum number of tokens (words + symbols) the model can generate in one response.
#Think of tokens as word-pieces — for English, roughly 1 token ≈ 0.75 words.
#800 means the model’s response can be up to about 600 words long.
MAX_TOKENS = 800

#This flag controls debugging mode.
#When True, your program will print extra information (like run IDs, token counts, etc.) for troubleshooting.
#When False, it stays clean and only shows normal output.
DEBUG = False

# Initialize colorama
init()

# Initialize OpenAI client
client = OpenAI(api_key=API_KEY)

# Set MLflow tracking URI and experiment
mlflow.set_tracking_uri(MLFLOW_URI) # MLflow will connect to a tracking server running locally on port 5000 Make sure you’ve already started MLflow by running "mlflow server"
mlflow.set_experiment("GenAI_book") # Creates a new or switches to that experiment called "GenAI_book"


<Experiment: artifact_location='mlflow-artifacts:/135424484768455482', creation_time=1762621961375, experiment_id='135424484768455482', last_update_time=1762621961375, lifecycle_stage='active', name='GenAI_book', tags={'mlflow.experimentKind': 'genai_development'}>

In [7]:

# ## Helper Functions
# - Colored print for user and AI messages  
# - Token counting  
# - Text generation function with MLflow logging  
def print_user_input(text):
    """Display user input in green color."""
    print(f"{Fore.GREEN}You: {Style.RESET_ALL}", text)

def print_ai_output(text):
    """Display AI output in blue color."""
    print(f"{Fore.BLUE}AI Assistant:{Style.RESET_ALL}", text)

In [8]:
# return token count, "cl100k_base" - default tokenizer 
def count_tokens(string: str, encoding_name="cl100k_base") -> int:
    """Count the number of tokens in a given string."""
    encoding = tk.get_encoding(encoding_name)  # This uses the Tiktoken library (tk) to get a tokenizer (encoding model) by name.
    encoded_string = encoding.encode(string, disallowed_special=()) # The disallowed_special=() parameter means "don't raise errors for special tokens" like <|endoftext|>).
    return len(encoded_string)  # Counts how many tokens were produced — that's the number of elements in the encoded list.

In [9]:
def generate_text(conversation, max_tokens=100) -> str:
    """Generate text from the OpenAI model and log metrics in MLflow."""
    start_time = time.time()

    # Call OpenAI ChatCompletion API - sends the entire conversation to the model and asks for the next message.
    response = client.chat.completions.create(
        model=MODEL, # (e.g., "gpt-3.5-turbo"
        messages=conversation, #  A list of messages (system, user, assistant).
        temperature=TEMPERATURE, # Controls creativity (higher = more random output).
        max_tokens=max_tokens, # Limits response length.
        top_p=TOP_P, # Controls diversity of output (used with nucleus sampling).
        frequency_penalty=FREQUENCY_PENALTY, # Reduces repetition of tokens.
        presence_penalty=PRESENCE_PENALTY #Encourages new topics in responses.
    )

    # Calculate latency
    latency = time.time() - start_time # Gives the response time in seconds, useful for performance tracking.
    message_response = response.choices[0].message.content # extracts the actual text (content) from the assistant’s message.

    # Token counting
    prompt_tokens = count_tokens(conversation[-1]['content']) #Number of tokens in the user’s most recent message.
    conversation_tokens = count_tokens(str(conversation)) # Tokens in the full chat history.
    completion_tokens = count_tokens(message_response) # Tokens generated by the model.

    # Log experiment data to MLflow
    run = mlflow.active_run() # Gets the currently active MLflow experiment run.
    if DEBUG:
        print(f"Run ID: {run.info.run_id}") # If debug mode is on, it prints the MLflow run ID and pauses execution (useful for troubleshooting).
        input("Press Enter to continue...")

    # Log metrics, Stores numeric metrics (quantitative data) in MLflow, like latency and token usage for tracking and analysis.
    mlflow.log_metrics({
        "request_count": 1,
        "request_latency": latency,
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "conversation_tokens": conversation_tokens
    })

    # Log parameters, Stores static parameters (experiment settings), so we can compare runs later.
    mlflow.log_params({
        "model": MODEL,
        "temperature": TEMPERATURE,
        "top_p": TOP_P,
        "frequency_penalty": FREQUENCY_PENALTY,
        "presence_penalty": PRESENCE_PENALTY
    })
    
    # Finally, returns the generated text so it can be printed or used elsewhere.
    return message_response 

In [10]:
# ## Run the Conversational Loop
# You can run this cell to interact with the model directly.  
# Type 'exit', 'quit', or 'q' to stop the loop.

# This is the main control loop of your chatbot program, which connects the user, OpenAI API, and MLflow tracking together.

#Enable MLflow automatic logging. This tells MLflow to automatically capture: Parameters, Metrics, Model versions, Run metadata
# It helps track experiments without manually logging every detail.
mlflow.autolog()


# This creates a new MLflow run — think of it as one experiment session.
# Everything that happens inside this with block (metrics, parameters, logs) is stored in this run.
# When the block ends, MLflow automatically closes and saves the run.
with mlflow.start_run() as run:
    conversation = [
        {"role": "system", "content": "You are a helpful assistant."},
    ]

    #This begins an infinite loop so you can chat continuously with the AI.
    while True:
        # Take user input
        user_input = input("User: ") # Waits for the user to type a message.

        # Exit condition
        # If the user types any of these commands, the loop breaks and the chat ends.
        if user_input.lower() in ["exit", "quit", "q", "e"]: 
            break

        #Show user message : This prints your message in green (from the earlier helper function), just to make the terminal display look cleaner.
        print_user_input(user_input)
        
        #Add user message to the chat: Adds user message to the conversation list so the model has full context for its next response.
        conversation.append({"role": "user", "content": user_input})

        #Generate AI response
        ai_output = generate_text(conversation, MAX_TOKENS)

        #Display AI response
        print_ai_output(ai_output)

        # Add AI response to conversation
        conversation.append({"role": "assistant", "content": ai_output})

2025/11/09 11:55:41 INFO mlflow.tracking.fluent: Autologging successfully enabled for openai.


User:  What is veteran day?


You:  What is veteran day?


2025/11/09 11:56:22 INFO mlflow.tracking.fluent: Autologging successfully enabled for langchain.


AI Assistant: Veterans Day is a federal holiday in the United States that honors military veterans who have served in the United States Armed Forces. It is observed annually on November 11th, the anniversary of the end of World War I. Veterans Day is a time to recognize and thank all those who have served in the military for their sacrifices and contributions to our country's freedom.



KeyboardInterrupt



In [25]:
# Open Anaconda prompt app and run below command
# mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts
# mlflow server
# http://127.0.0.1:5000
