# Interactive Chat Bot

## Objective
Build and fine-tune a system that leverages pre-trained transformer models to address real-world tasks. This system will help a student seeking an internship by performing various tasks involving natrual language understanding and generation.

## Specifications
1. Interactive Command Line Interface
    * Create a text-based program that interacts with users through open-ended plain-text prompts.
3. Sentence Auto-Completion
    * Implement functionality that predicts and completes common sentences based on context. Use your transformer model to generate plausible continuations for user-provided text.
5. Cover Letter Analysis
    * Design a feature where the system reads a provided cover letter and extracts meaningful information about the applicant.
7. Dynamic Cover Letter Generation
    * Enable the system to generate a personalized cover letter based on user-specified constraints, such as desired role, key skills, or industry. Combine a flexible template with dynamic content generation to meet specific user needs.

In [9]:
import pandas as pd
import torch

menu_df = pd.read_csv("sample_cover_letters.csv")

menu_df.tail()

Unnamed: 0,Cover Letter,Name,Skills,Objective
95,"Dear Hiring Manager,\r\n\r\nMy name is Daniel ...",Daniel Brown,"SQL, Cloud Computing, Data Analysis, Java",To advance in the field of artificial intellig...
96,"Dear Hiring Manager,\r\n\r\nMy name is Isla Wi...",Isla Wilson,"Java, Project Management, Data Analysis",Looking for a role in financial modeling and d...
97,"Dear Hiring Manager,\r\n\r\nMy name is Bob Smi...",Bob Smith,"Leadership, SQL, Data Analysis, Machine Learning",Seeking a creative design role to innovate use...
98,"Dear Hiring Manager,\r\n\r\nMy name is Isla Wi...",Isla Wilson,"Cloud Computing, Deep Learning, Machine Learni...",Looking for a role in financial modeling and d...
99,"Dear Hiring Manager,\r\n\r\nMy name is Daniel ...",Daniel Brown,"Data Analysis, Cloud Computing, Machine Learni...",Seeking a software engineering position to app...


In [10]:
#!pip install datasets
from datasets import load_dataset

dataset = load_dataset("csv",data_files={"train":"sample_cover_letters.csv"},split='train')

Generating train split: 0 examples [00:00, ? examples/s]

In [11]:
dataset

Dataset({
    features: ['Cover Letter', 'Name', 'Skills', 'Objective'],
    num_rows: 100
})

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["Cover Letter"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [14]:
# from sklearn.preprocessing import MultiLabelBinarizer
# import torch

# # Extract relevant columns from the dataset
# names = dataset["Name"]
# skills = dataset["Skills"]
# objectives = dataset["Objective"]

# # Combine labels into a structured format for MultiLabelBinarizer
# # Here, each entry is a tuple (Name, Skills, Objective)
# structured_labels = [{"Name": name, "Skills": skills_list.split(','), "Objective": obj}
#                      for name, skills_list, obj in zip(names, skills, objectives)]

# # Prepare the MultiLabelBinarizer for the structured task
# mlb = MultiLabelBinarizer()

# # Extract multi-label outputs (flatten the dictionary values)
# flattened_labels = [
#     [name] + skills + [objective]
#     for entry in structured_labels
#     for name, skills, objective in zip(
#         [entry['Name']],
#         entry['Skills'],
#         [entry['Objective']]
#     )
# ]

# # Fit and transform the flattened labels
# label_encoded = mlb.fit_transform(flattened_labels)

# # Map the encoded labels back to the dataset
# def add_labels(example, idx):
#     example['labels'] = torch.tensor(label_encoded[idx], dtype=torch.float)
#     return example

# tokenized_datasets = tokenized_datasets.map(add_labels, with_indices=True)
from sklearn.preprocessing import LabelBinarizer

enc = LabelBinarizer()
enc.fit(menu_df['Label'])

KeyError: 'Label'

In [None]:
enc.transform(menu_df['Label'])[:10]

array([[0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]])

In [None]:
import torch
def add_labels(example):
    example['labels'] = torch.tensor(enc.transform([example['Label']])[0]*1.0)
    return example

tokenized_datasets = tokenized_datasets.map(add_labels)

Map:   0%|          | 0/303 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

Dataset({
    features: ['Text', 'Label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 303
})

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["Text"])
tokenized_datasets = tokenized_datasets.remove_columns(["Label"])
#tokenized_datasets = tokenized_datasets.rename_column("Label", "labels")
tokenized_datasets.set_format("torch")

In [None]:
tokenized_datasets

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 303
})

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

model_ckpt = "bert-base-uncased"  # etc.
num_labels = 3  # etc.
# 1. Example: Help me write a cover letter
# 2. Example: Help me read a cover letter
# 3. Example: Exit

model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels=num_labels,
    problem_type="multi_label_classification",  # this is important
)

# Tokenize input text
text = "This is a great example."
inputs = tokenizer(text, return_tensors="pt")

# Get model output
outputs = model(**inputs)

# Process output
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
logits.shape

torch.Size([1, 3])

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
from torch.utils.data import TensorDataset, DataLoader, Dataset

In [None]:
# here is a loop just showing how we are going to do this

# Create a DataLoader
train_dataloader = DataLoader(tokenized_datasets, batch_size=5, shuffle=True)

# Iterate through the DataLoader
for batch in train_dataloader:
    break

In [None]:
batch

{'input_ids': tensor([[ 101, 2079, 2017,  ...,    0,    0,    0],
         [ 101, 1045, 1521,  ...,    0,    0,    0],
         [ 101, 2043, 1996,  ...,    0,    0,    0],
         [ 101, 2071, 2017,  ...,    0,    0,    0],
         [ 101, 2054, 1521,  ...,    0,    0,    0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([[0., 1., 0.],
         [0., 0., 1.],
         [1., 0., 0.],
         [0., 1., 0.],
         [0., 0., 1.]])}

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/183 [00:00<?, ?it/s]

In [None]:
text = "Stop this darn program."
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get model output
outputs = model(**inputs)

# Process output
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)

In [None]:
logits,predictions,enc.inverse_transform(logits.cpu().detach().numpy())

(tensor([[ 4.2353, -3.5931, -3.5777]], device='cuda:0',
        grad_fn=<AddmmBackward0>),
 tensor([0], device='cuda:0'),
 array(['Exit'], dtype='<U5'))

In [None]:
text = "Please help me write the best cover letter to get a deep learning internship"
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get model output
outputs = model(**inputs)

# Process output
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)

In [None]:
logits,predictions,enc.inverse_transform(logits.cpu().detach().numpy())

In [None]:
text = "I am a recruiting manager and I need help reading a resume"
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get model output
outputs = model(**inputs)

# Process output
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)

In [None]:
logits,predictions,enc.inverse_transform(logits.cpu().detach().numpy())

(tensor([[-3.2274, -3.1235,  2.7861]], device='cuda:0',
        grad_fn=<AddmmBackward0>),
 tensor([2], device='cuda:0'),
 array(['Write'], dtype='<U5'))

In [None]:
import torch

# Save the model's state dictionary
torch.save(model.state_dict(), 'my_model_read_cover.pth')

# Save the tokenizer (optional but recommended)
tokenizer.save_pretrained('my_tokenizer_read_cover')

# To load the model later:
# model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels)
# model.load_state_dict(torch.load('my_model_prompting.pth'))
# tokenizer = AutoTokenizer.from_pretrained('my_tokenizer_prompting')

('my_tokenizer/tokenizer_config.json',
 'my_tokenizer/special_tokens_map.json',
 'my_tokenizer/vocab.txt',
 'my_tokenizer/added_tokens.json',
 'my_tokenizer/tokenizer.json')

In [34]:
# load model, test a user input
import torch
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels)
model.load_state_dict(torch.load('my_model_read_cover.pth', map_location=torch.device('cpu')))
tokenizer = AutoTokenizer.from_pretrained('my_tokenizer_read_cover')

text = input()
inputs = tokenizer(text, return_tensors="pt")

# Get model output
outputs = model(**inputs)

# Process output
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
logits,predictions,enc.inverse_transform(logits.cpu().detach().numpy())

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  model.load_state_dict(torch.load('my_model_prompting.pth', map_location=torch.device('cpu')))


 help me with the interpretation of my cover letter


(tensor([[-3.6832, -3.1143,  3.5730]], grad_fn=<AddmmBackward0>),
 tensor([2]),
 array(['Write'], dtype='<U5'))

In [1]:
commit_message = input("Enter commit message: ")
!git add *.ipynb
!git commit -m "{commit_message}"
!git push


Enter commit message:  updated user_prompting


[main d62e8ac] updated user_prompting
 2 files changed, 22 insertions(+), 1046 deletions(-)
 delete mode 100644 .ipynb_checkpoints/Main-checkpoint.ipynb
 rename Main.ipynb => user_prompting.ipynb (99%)
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 21.09 KiB | 10.55 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/michaelmurrayiv/Transformer-Chat-Bot.git
   6e3ba08..d62e8ac  main -> main
