# Iterating on the system ðŸ”„

With the most important components in place for evaluating, experimenting with, and monitoring the system, we can now iterate on the system to improve it.

This will be more of a free format, where the choice is up to you what you wish to iterate on. Here are some suggestions:

- **Improving latency** 
  - Look into asynchronous processing
  - Combine multiple requests
  - Try different models for different tasks
- **Improve evaluation metrics**
  - Try different models and parameters
  - Add metrics and extend evaluation
  - Use [advanced prompting techniques](https://python.useinstructor.com/prompting/)
- **Address downvoted outputs**
  - Label downvoted article outputs, and add to evaluation data
  - Improve model to address downvoted outputs
- **Improve monitoring results**
  - Reduce number of tokens, increase success rates, ...
  - Extend dashboad with useful widgets
- **Other**
  - Extend unit tests
  - Ensure extracted quotes are actually in the article (see [inspiration](https://python.useinstructor.com/examples/exact_citations/))
  - Run the evaluations in a Azure ML job so we can run them on a schedule
  - ... And many more!

Most importantly, try to make your progress visible/measurable.
We have experiments, logs, traces, a monitoring dashboard that we can use,
so try to iterate in a way that is visible in these tools.

If you wish to improve something that is not yet measurable or visualized in our monitoring setup, try to extend the monitoring setup first to include it! For example, we are currently not logging which model we are using, but that could be a useful addition to the logs if we are experimenting with different models.

So at each step: check logs/traces/metrics, improve, re-deploy, collect monitoring data, repeat.

Good luck! ðŸš€

And for more inspiration, here are some deepdive examples of our feedback data! ðŸ‘‡

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from llmops_training.news_reader.logs import load_entries_with_feedback

In [None]:
downvoted = load_entries_with_feedback("downvote", from_hours_ago=1.5)
upvoted = load_entries_with_feedback("upvote", from_hours_ago=1.5)

In [None]:
fig = plt.figure(figsize=(5, 2.5))

downvotes = [len(downvoted), 0]
upvotes = [0, len(upvoted)]

plt.bar(["downvotes", "upvotes"], downvotes, color="red", alpha=0.7, label="downvotes")
plt.bar(["downvotes", "upvotes"], upvotes, color="green", alpha=0.7, label="upvotes")

plt.legend()

plt.ylabel("# votes")

fig.tight_layout()
plt.show()

In [None]:
downvoted_events = [entry["jsonPayload"]["event"] for entry in downvoted]
upvoted_events = [entry["jsonPayload"]["event"] for entry in upvoted]

all_events = list(set(downvoted_events + upvoted_events))

downvote_counts = [downvoted_events.count(event) for event in all_events]
upvote_counts = [upvoted_events.count(event) for event in all_events]

x = np.arange(len(all_events))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 2.5))

rects1 = ax.bar(x - width/2, downvote_counts, width, label='Downvotes', color='red', alpha=0.7)
rects2 = ax.bar(x + width/2, upvote_counts, width, label='Upvotes', color='green', alpha=0.7)

ax.set_ylabel('# votes')
ax.set_xticks(x)
ax.set_xticklabels(all_events)
ax.legend()

fig.tight_layout()

plt.show()

In [None]:
downvoted_lengths = [len(entry["jsonPayload"]["article"]) for entry in downvoted]
upvoted_lengths = [len(entry["jsonPayload"]["article"]) for entry in upvoted]

downvoted_mean = np.mean(downvoted_lengths)
downvoted_std = np.std(downvoted_lengths)
upvoted_mean = np.mean(upvoted_lengths)
upvoted_std = np.std(upvoted_lengths)

categories = ['Downvoted', 'Upvoted']
means = [downvoted_mean, upvoted_mean]
stds = [downvoted_std, upvoted_std]

fig, ax = plt.subplots(figsize=(5, 2.5))

for i, (mean, std, c) in enumerate(zip(means, stds, [(0.7,0,0,0.7), (0,0.7,0,0.7)]), 1):
    ax.errorbar(mean, i, xerr=std, fmt='o', capsize=5, color=c, markersize=10)

ax.set_xlabel('Average article length')
ax.set_title('Average article length with standard deviation')
ax.set_yticks(range(0, 4))
ax.set_yticklabels([""] + categories + [""])

fig.tight_layout()

plt.show()

In [None]:
_downvoted = [entry for entry in downvoted if "title" in entry["jsonPayload"]["output"]]
_upvoted = [entry for entry in upvoted if "title" in entry["jsonPayload"]["output"]]

downvoted_lengths = [len(entry["jsonPayload"]["output"]["title"]) for entry in _downvoted]
upvoted_lengths = [len(entry["jsonPayload"]["output"]["title"]) for entry in _upvoted]

# Calculate mean and standard deviation
downvoted_mean = np.mean(downvoted_lengths)
downvoted_std = np.std(downvoted_lengths)
upvoted_mean = np.mean(upvoted_lengths)
upvoted_std = np.std(upvoted_lengths)

categories = ['Downvoted', 'Upvoted']
means = [downvoted_mean, upvoted_mean]
stds = [downvoted_std, upvoted_std]

fig, ax = plt.subplots(figsize=(5, 2.5))

for i, (mean, std, c) in enumerate(zip(means, stds, [(0.7,0,0,0.7), (0,0.7,0,0.7)]), 1):
    ax.errorbar(mean, i, xerr=std, fmt='o', capsize=5, color=c, markersize=10)

ax.set_xlabel('Average output title length')
ax.set_title('Average output title length with standard deviation')
ax.set_yticks(range(0, 4))
ax.set_yticklabels([""] + categories + [""])

fig.tight_layout()

plt.show()