<a href="https://colab.research.google.com/github/mr-cri-spy/Deep-Learning-projects/blob/main/Auto_Reply_Optimizer_using_Reinforcement_Learning_and_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Dependencies

In [1]:
!pip install transformers
!pip install gymnasium
!pip install stable-baselines3
!pip install sentencepiece


Collecting stable-baselines3
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (

Import Libraries

In [3]:
import random
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from transformers import T5Tokenizer, T5ForConditionalGeneration
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env


Sample Email Dataset (Synthetic)

In [4]:
sample_emails = [
    {"subject": "Meeting Request", "body": "Can we schedule a meeting tomorrow at 10 AM?"},
    {"subject": "Leave Application", "body": "I'd like to apply for leave next Monday and Tuesday."},
    {"subject": "Project Status", "body": "What is the current progress on the AI model development?"},
    {"subject": "Thanks", "body": "Thanks for your prompt reply to my earlier email."},
]

def get_random_email():
    return random.choice(sample_emails)


Load T5 Model for Reply Generation

In [5]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

def generate_reply(email_text):
    input_text = "reply: " + email_text
    input_ids = tokenizer.encode(input_text, return_tensors='pt', truncation=True)
    output_ids = model.generate(input_ids, max_length=50)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Define Custom Gym Environment

In [6]:
class EmailReplyEnv(gym.Env):
    def __init__(self):
        super(EmailReplyEnv, self).__init__()
        self.action_space = spaces.Discrete(3)  # 0: Short, 1: Medium, 2: Long
        self.observation_space = spaces.Box(low=0, high=255, shape=(1,), dtype=np.uint8)
        self.current_email = None

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_email = get_random_email()
        return np.array([0], dtype=np.uint8), {}

    def step(self, action):
        email_text = self.current_email["subject"] + ": " + self.current_email["body"]

        if action == 0:
            reply = generate_reply(email_text[:20])  # short
        elif action == 1:
            reply = generate_reply(email_text[:50])  # medium
        else:
            reply = generate_reply(email_text)       # full

        # Simulated reward system
        reward = [0.3, 1.0, 0.8][action]

        terminated = True
        truncated = False
        info = {"reply": reply}

        return np.array([0], dtype=np.uint8), reward, terminated, truncated, info


Check Environment Validity

In [7]:
env = EmailReplyEnv()
check_env(env)


Train PPO Agent

In [8]:
model_rl = PPO("MlpPolicy", env, verbose=1)
model_rl.learn(total_timesteps=1000)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1        |
|    ep_rew_mean     | 0.683    |
| time/              |          |
|    fps             | 2        |
|    iterations      | 1        |
|    time_elapsed    | 695      |
|    total_timesteps | 2048     |
---------------------------------


<stable_baselines3.ppo.ppo.PPO at 0x7b5b25508cd0>

Test & Display Replies

In [10]:
for i in range(5):
    obs, _ = env.reset()
    action, _states = model_rl.predict(obs)
    obs, reward, terminated, truncated, info = env.step(action)

    print(f" Email {i+1}")
    print("Subject:", env.current_email["subject"])
    print("Body:", env.current_email["body"])
    print(" AI Reply:", info["reply"])
    print(" Reward:", reward)
    print("====================================\n")


 Email 1
Subject: Leave Application
Body: I'd like to apply for leave next Monday and Tuesday.
 AI Reply: Antwort: Leave Application: Leave Application: Leave Application: Leave Application: I'd like to apply for leave next Monday and Tuesday.
 Reward: 0.8

 Email 2
Subject: Thanks
Body: Thanks for your prompt reply to my earlier email.
 AI Reply: Vielen Dank für Ihre prompte, wortwortwortete ich Ihnen an dieser Stelle.
 Reward: 1.0

 Email 3
Subject: Leave Application
Body: I'd like to apply for leave next Monday and Tuesday.
 AI Reply: True
 Reward: 1.0

 Email 4
Subject: Leave Application
Body: I'd like to apply for leave next Monday and Tuesday.
 AI Reply: True
 Reward: 1.0

 Email 5
Subject: Thanks
Body: Thanks for your prompt reply to my earlier email.
 AI Reply: reply:
 Reward: 0.3

