<img src="https://fsdl.me/logo-720-dark-horizontal">

# Lab 08: Monitoring

## What You Will Learn

- How to add user feedback and model monitoring to a Gradio-based app
- How to analyze this logged information to uncover and debug model issues
- Just how large the gap between benchmark data and data from users can be, and what to do about it

In [None]:
lab_idx = 8


if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap
    
    %matplotlib inline

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

### Follow along with a video walkthrough on YouTube:

In [None]:
from IPython.display import IFrame


IFrame(src="https://fsdl.me/2022-lab-08-video-embed", width="100%", height=720)

# Basic user feedback with `gradio`

On top of the basic health check and event logging
necessary for any distributed system
(provided for our application by
[AWS CloudWatch](https://aws.amazon.com/cloudwatch/),
which is collects logs from EC2 and Lambda instances),
ML-powered applications need specialized monitoring solutions.

In particular, we want to give users a way
to report issues or indicate their level of satisfaction
with the model.

The UI-building framework we're using, `gradio`,
comes with user feedback, under the name "flagging".

To see how this works, we first spin up our front end,
pointed at the AWS Lambda backend,
as in
[the previous lab](https://fsdl.me/lab07-colab).

In [None]:
from app_gradio import app


lambda_url = "https://3akxma777p53w57mmdika3sflu0fvazm.lambda-url.us-west-1.on.aws/"

backend = app.PredictorBackend(url=lambda_url)

And adding user feedback collection
is as easy as passing `flagging=True`.

> <small> The `flagging` argument is here being given to
code from the FSDL codebase, `app.make_frontend`,
but you can just pass
`flagging=True` directly
to the `gradio.Interface` class.
In between in our code,
we have a bit of extra logic
so that we can support
multiple different storage backends for logging flagged data.
</small>

Run the cell below to create a frontend
(accessible on a public Gradio URL and inside the notebook)
and observe the new "flagging" buttons underneath the outputs.

In [None]:
frontend = app.make_frontend(fn=backend.run, flagging=True)
frontend.launch(share=True)

Click one of the buttons to trigger flagging.

It doesn't need to be a legitimate issue with the model's outputs.

Instead of just submitting one of the example images,
you might additionally use the image editor
(pencil button on uploaded images)
to crop it.

Flagged data is stored on the server's local filesystem,
by default in the `flagged/` directory
as a `.csv` file:

In [None]:
!ls flagged

We can load the `.csv` with `pandas`,
the Python library for handling tabular data.

In [None]:
from pathlib import Path

import pandas as pd


log_path = Path("flagged") / "log.csv"

flagged_df = None
if log_path.exists():
    flagged_df = pd.read_csv(log_path, quotechar="'")  # quoting can be painful for natural text data
    flagged_df = flagged_df.dropna(subset=["Handwritten Text"])  # drop any flags without an image

flagged_df

Notice that richer data, like images, is stored with references --
here, the names of local files.

This is a common pattern:
binary data doesn't go in the database,
only pointers to binary data.

We can then read the data back to analyze our model.

In [None]:
from IPython.display import display

from text_recognizer.util import read_image_pil


if flagged_df is not None:
    row = flagged_df.iloc[-1]
    print(row["output"])
    display(read_image_pil(Path("flagged") / row["Handwritten Text"]))

We encourage you to play around with the model for a bit,
uploading your own images.

This is an important step in understanding your model
and your domain --
especially when you're familiar with the data types involved.

But even when you are,
we expect you'll quickly find
that you run out of ideas
for different ways to probe your model.

To really learn more about your model,
you'll need some actual users.

In small projects,
these can be other team members who are less enmeshed
in the details of model development and data munging.

But to create something that can appeal to a broader set of users,
you'll want to collect feedback from your potential userbase.

# Debugging production models with `gantry`

Unfortunately, this aspect of model development
is particularly challenging to replicate in
a course setting, especially a MOOC --
where do these users come from?

As part of the 2022 edition of the course, we've
[been running a text recognizer application](https://fsdl-text-recognizer.ngrok.io)
and collecting user feedback on it.

Rather than saving user feedback data locally,
as with the CSV logger above,
we've been sending that data to
[Gantry](https://gantry.io/),
a model monitoring and continual learning tool.

That's because local logging is a very bad idea:
as logs grow, the storage needs and read/write time grow,
which unduly burdens the frontend server.

The `gradio` library supports logging of user-flagged data
to arbitrary backends via
`FlaggingCallback`s.

So there's some new elements to the codebase:
most importantly here, a `GantryImageToTextLogger`
that inherits from `gradio.FlaggingCallback`.

In [None]:
from app_gradio import flagging


print(flagging.GantryImageToTextLogger.__init__.__doc__)

If we add this `Callback` to our setup --
and add a Gantry API key to our environment --
then we can start sending data to Gantry's service.

In [None]:
app.make_frontend??

The short version of how the logging works:
we upload flagged images to S3 for storage (`GantryImageToTextLogger._to_s3`)
and send the URL to Gantry along with the outputs (`GantryImageToTextLogger._to_gantry`).

Below, we'll download that data
and look through it in the notebook,
using typical Python data analysis tools,
like `pandas` and `seaborn`.

By analogy to
[EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis),
consider this an "exploratory model analysis".

In [None]:
import gantry.query as gq


read_only_key = "VpPfHPDSk9e9KKAgbiHBh7mqF_8"
gq.init(api_key=read_only_key)

gdf = gq.query(  # we query Gantry's service with the following parameters:
    application="fsdl-text-recognizer",  # which tracked application should we draw from?
    tags={"env": "dev"},  # which tags (here, logging environment) should we filter to?
    # what time period should we pull data from? here, the first two months the app was up
    start_time="2022-07-01T07:00:00.000Z",
    end_time="2022-09-01T06:59:00.000Z",
)

raw_df = gdf.fetch()
df = raw_df.dropna(axis="columns", how="all")  # remove any irrelevant columns
print("number of rows:", len(df))
df = df.drop_duplicates(keep="first", subset="inputs.image")  # remove repeated reports, eg of example images
print("number of unique rows:", len(df))

print("\ncolumns:")
df.columns

We'll walk through what each of these columns means,
but the three most important are the ones we logged directly from the application:
`flag`s, `input.image`s, and `output_text`.

In [None]:
main_columns = [column for column in df.columns if "(" not in column]  # derived columns have a "function call" in the name
main_columns

If you're interested in playing
around with the data yourself
in Gantry's UI,
as we do in the
[video walkthrough for the lab](https://fsdl.me/2022-lab-08-video),
you'll need a Gantry account.

Gantry is currently in closed beta.
Unlike training or experiment management,
model monitoring and continual learning
is at the frontier of applied ML,
so tooling is just starting to roll out.

FSDL students are invited to this beta and
[can create a "read-only" account here](https://gantry.io/fsdl-signup)
so they can view the data in the UI
and explore it themselves.

As an early startup,
Gantry is very interested in feedback
from practitioners!
So if you do try out the Gantry UI,
send any impressions, bug reports, or ideas to
`support@gantry.io`

This is also a chance for you
to influence the development
of a new tool that could one day
end up at the center of continual learning
workflows --
as when
[FSDL students in spring 2019 got a chance to be early users of W&B](https://www.youtube.com/watch?t=1468&v=Eiz1zcqrqw0&feature=youtu.be&ab_channel=FullStackDeepLearning).

## Basic stats and behavioral monitoring

We start by just getting some basic statistics.

For example, we can get descriptive statistics for
the information we've logged.

In [None]:
df["feedback.flag"].describe()

Note that the format we're working with is the `pandas.DataFrame` --
a standard format for tables in Python.

`pandas` can be
[very tricky](https://github.com/chiphuyen/just-pandas-things).

It's not so bad when doing exploratory analysis like this,
but take care when using it in production settings!

If you'd like to learn more `pandas`,
[Brandon Rhodes's `pandas` tutorial from PyCon 2015](https://www.youtube.com/watch?v=5JnMutdy6Fw&ab_channel=PyCon2015)
is still one of the best introductions,
even though it's nearly a decade old.

`pandas` objects support sampling with `.sample`,
which is useful for quick "spot-checking" of data.

In [None]:
df["feedback.flag"].sample(10)

Unlike many other kinds of applications,
toxic and offensive behavior is
one of the most critical potential issues with
many ML models,
from
[generative models like GPT-3](https://www.middlebury.edu/institute/sites/www.middlebury.edu.institute/files/2020-09/gpt3-article.pdf)
to even humble
[image labeling models](https://archive.nytimes.com/bits.blogs.nytimes.com/2015/07/01/google-photos-mistakenly-labels-black-people-gorillas/).

So ML models, especially when newly deployed
or when encountering new user bases,
need careful supervision.

We use a
[Gantry tool called Projections](https://docs.gantry.io/en/stable/guides/projections.html)
to apply the NLP models from the
[`detoxify` suite](https://github.com/unitaryai/detoxify),
which score text for features like obscenity and identity attacks,
to our model's outputs.

To get a quick plot of the resulting values,
we can use the `pandas` built-in interface
to `matplotlib`:

In [None]:
df.plot(y="detoxify.obscene(outputs.output_text)", kind="hist");

Without context, this chart isn't super useful --
is a score of `obscene=0.12` bad?

We need a baseline!

Once the model is stable in production,
we can compare the values across time --
grouping or filtering production data by timestamp.

Here, for this first version of the model,
we compare the results here with the results on the test data,
which was also ingested with `gantry`.

In [None]:
gdf = gq.query(
    application="fsdl-text-recognizer",
    tags={"env": "test"},  # picks out the "test" environment
    start_time="2022-08-12T02:15:00.000Z",
    end_time="2022-08-12T03:00:00.000Z"
)

raw_test_df = gdf.fetch()
test_df = raw_test_df.dropna(axis="columns", how="all")  # remove any irrelevant columns

test_df.sample(10)  # show a sample

To compare the two `DataFrame`s,
we `concat`enate them together
and add in some metadata
identifying where the observations came from.


In [None]:
test_df["environment"] = "test"
df["environment"] = "prod"

comparison_df = pd.concat([df, test_df])

From there, we can use grouping to calculate statistics of interest:

In [None]:
stats = comparison_df.groupby("environment").describe()

stats["detoxify.obscene(outputs.output_text)"]

These descriptive statistics are helpful,
but as with our simple plot above,
we want to _look_ at the data.

Exploratory data analysis is typically very visual --
the goal is to find phenomena so obvious
that statistical testing is an afterthought --
and so is exploratory model analysis.

`matplotlib` is based on plotting arrays,
rather than `DataFrame`s or other tabular data,
so it's not a great fit on its own here,
unless we want to tolerate a lot of boilerplate.

`pandas` has basic built-in plotting
that interfaces with `matplotlib`,
but it's not that ergonomic for comparisons or flexible
without just dropping back to matplotlib.

There are a number of other Python plotting libraries,
many with an emphasis on share-ability and interaction
([Vega-Altair](https://altair-viz.github.io/),
[`bokeh`](http://bokeh.org/),
and
[Plotly](https://plotly.com/),
to name a few)
and others with an emphasis on usability
(e.g. [`ggplot`](https://realpython.com/ggplot-python/)).

The one that we like for in-notebook analysis
that balances ease of use
on tabular data with flexibility is
[`seaborn`](https://seaborn.pydata.org/).

Comparing the distributions of the `detoxify.obscene` metric
is a single function call:

In [None]:
import seaborn as sns


sns.displot(   # plot the dis-tribution
    data=comparison_df,  # of data from this df
    # specifically, this column, along the x-axis
    x="detoxify.obscene(outputs.output_text)",
    # and split it up (in color/hue) by this column
    hue="environment"
);

We can quickly see that the obscenity scores according to `detoxify`
are generally lower in our `prod`uction environment,
so we don't have a reason to suspect
our model is behaving too badly in production
-- though see the exercises for more on this!

We can see the same thing
without having to write query, cleaning, and plotting code
[in the Gantry UI here](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest) --
note that viewing the dashboard requires a Gantry account,
which you can sign up for
[here](https://gantry.io/fsdl-signup).

## Debugging the Text Recognizer

In our application,
we don't have user corrections or labels from annotators,
so we can't calculate an accuracy, a loss, or a character error rate.

We instead look for signals that are correlated with
those values.

This approach has limits
(see, e.g. the analysis in the
[MLDeMon paper](https://arxiv.org/abs/2104.13621))
and setting alerts or test failures on things that are only correlated with,
rather than directly caused by, poor performance is a bad idea.

But it's very useful to have this information logged
to catch large errors at a glance
or to provide tools for slicing, filtering, and grouping data
while doing exploratory model analysis or debugging.

We can also compute these signals with Gantry Projections.

Low entropy (e.g. repetition) is a failure mode of language models,
as is excessively high entropy (e.g. uniformly random text).

We can review the output text entropy distributions in
production and during testing
by plotting them against one another
(here or
[in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest)).

In [None]:
sns.displot(
    data=comparison_df,
    x="text_stats.basics.entropy(outputs.output_text)",
    hue="environment"
);

It appears there are more low-entropy strings in the model's outputs in production.

With models that operate on human-relevant data,
like text and images,
it's important to look at the raw data,
not just projections.

Let's take a look at a sample of outputs from the model running on test data:

In [None]:
test_df["outputs.output_text"].sample(10)

The results are not incredible, but they are recognizably "English with typos".

Let's look specifically at low entropy examples from production
(we can also view this
[filtered data in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest)).

In [None]:
df.loc[df["text_stats.basics.entropy(outputs.output_text)"] < 5]["outputs.output_text"].sample(10)

Yikes! Lots of repetitive gibberish.

Knowing the outputs are bad,
there are two culprits:
the input-output mapping (aka the model)
or the inputs.

We ran the same model in a similar environment
to get those outputs,
so it's most likely due to some difference in the inputs.

Let's check them!

We added Gantry Projections to look at the distribution of pixel values as well.

In [None]:
sns.displot(
    data=comparison_df,
    x="image.greyscale_image_mean(inputs.image)",
    hue="environment"
);

There's a huge difference in mean pixel values --
almost all images have mean intensities that are very dark in the testing environment,
but we see both dark and light images in production.

Reviewing the
[raw data in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest)
confirms that we are getting images with very different brightnesses in production
and whiffing the predictions
-- along with images that reveal a number of other interesting failure modes.

To take a look locally,
we'll need to pull the images down from S3,
where they are stored.

The cell below defines a quick utility for
reading from S3 without authentication.

It is based on the `smart_open` and `boto3` libraries,
which we briefly saw in the
[model deployment lab](https://fsdl.me/lab07-colab)
and the
[data annotation lab](https://fsdl.me/lab06-colab).

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import smart_open

from text_recognizer.util import read_image_pil_file

# spin up a client for communicating with s3 without authenticating ("UNSIGNED" activity)
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
unsigned_params = {"client": s3}

def read_image_unsigned(image_uri, grayscale=False):
    with smart_open.open(image_uri, "rb", transport_params=unsigned_params) as image_file:
        return read_image_pil_file(image_file, grayscale)

Run the cell below to repeatedly sample a random input/output pair
flagged in production.

In [None]:
row = df.sample().iloc[0]
print("image url:", row["inputs.image"])
print("prediction:", row["outputs.output_text"])
read_image_unsigned(row["inputs.image"])

### Take-aways for developing models

The most immediate take-away from reviewing just a few examples is that
user data is way more heterogeneous than train/val/test data!

This a
[fairly](https://browsee.io/blog/a-guide-to-session-replays-for-product-managers/)
[universal](https://medium.com/@beasles/edge-case-responsive-design-9b610138ddbd)
[finding](https://quoteinvestigator.com/2021/05/04/no-plan/).

Let's also consider some specific failure modes in our case
and how we might resolve them:

- Failure mode: Users mostly provide images with dark text on light background, but we train on dark background.
  - Resolution: We could check image brightness and flip if needed,
  but this feels like a cop-out -- most text is dark on a light background! 
  - Resolution: We add image brightness inversion to our train-time augmentations.
- Failure mode: Users expect our "handwritten text recognition" tool to work with printed and digital text.
  - Resolution: We could try better sign-posting and user education,
  but this is also something of a cop-out.
  Users expect the tool to work on all text,
  so we shouldn't violate that expectation.
  - Resolution: We synthesize digital text data --
  text rendering is a feature of just about any mature programming language.
- Failure mode: Users provide text on heterogeneous backgrounds
  - Resolution: We collect or synthesize more heterogeneous data,
  e.g. placing text (with or without background coloring)
  on top of random image backgrounds.
- Failure mode: Users provide text with characters and symbols outside of our dictionary.
  - Resolution: We can expand the model outputs and collect more heterogeneous data
- Failure mode: Users provide images with multiple blocks of text
  - Resolution: We develop an architecture/task definition that can handle multiple regions.
  We'll need to collect and/or synthesize data to support

Notice: these are almost entirely changes to data,
and most of them involve collecting more or synthesizing it.

This is very much typical!

Data drives improvements to models,
[even at scale](https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications).

### Take-aways for exploratory model analysis

Notice that we had to write a lot of code,
which was developed and which we ran in a
tight interactive loop.

This type of code is very hard to turn into scripts --
how do you trigger an alert on a plot? --
which makes it brittle and hard to version and share.

It's also based on possibly very large-scale data artifacts.

The right tool for this job is a UI
on top of a database.

In the
[video walkthrough for this lab](https://fsdl.me/2022-lab-08-video),
we do the effectively the same analysis,
but inside Gantry,
which makes the process more fluid.

Gantry is still in closed beta,
but if you're interested in applying it to your own applications, you can
[join the waitlist](https://gantry.io/waitlist/).

# Exercises

### 🌟 Examine the test data strings, both output and ground truth.

We compared our production obscenity metric to the test-time values of that same metric
and determined that we had not gotten worse,
so things were fine.

But what if the test-time baseline is bad?

Review the raw test ground truth data
[here](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=test-ingest),
if you
[signed up a Gantry account](https://gantry.io/fsdl-signup),
or by looking at the contents of `test_df` above.

Sort by `detoxify.identity_attack(feedback.ground_truth_string)`
or filter to only high values of that metric.

Review the example `feedback.ground_truth_string` texts and consider:
is this the subset of English
we want the model to be training on?
what objections might be raised to the contents?

You might also look for cases where the `detoxify` models misunderstood meaning --
e.g. an innocuous use of a word that's often used objectionably.

### 🌟🌟 Start building "regression testing suites" by doing error analysis on these examples.

Do this by going through feedback data one image/text pair at a time --
[in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entrop)
or inside this notebook.

Start by just taking notes on each example
(anywhere -- Google Sheets/Excel/Notion, or just a sheet of paper).

The primary question you should ask is:
how does this example differ from the data shown in training?

Check
[this W&B Artifact page](https://wandb.ai/cfrye59/fsdl-text-recognizer-2021-training/artifacts/run_table/run-1vrnrd8p-trainpredictions/v194/files/train/predictions.table.json#f5854c9c18f6c24a4e99)
to see what training data
(including augmentation)
looks like.

Once you have some notes,
try and formalize them into a small number of "failure modes" --
you can choose to align them with the failure modes described in the section
on take-aways for model development or not.

If you want to finish the loop,
you might set up Label Studio, as in
[the data annotation lab](https://fsdl.me/lab06-colab).
An annotator should add at least a
"label" that gives the type of issue
and perhaps also add a text annotation
while they are at it.