# MLOps - Final Project
#### Luke Schwenke & Aaron Chan
#### December 4, 2023

### LLM - HuggingFace

Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Model Chosen = microsoft/phi-1_5

Link to Model: https://huggingface.co/microsoft/phi-1_5

The language model phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters.

In [2]:
#!pip install einops
#!pip install -U transformers
#!pip install protobuf

In [3]:
from transformers import pipeline
pipe = pipeline("text-generation", model="microsoft/phi-1_5", trust_remote_code=True)

In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

In [5]:
prompts = [
    "What types of exercise are best for people with asthma?",
    "How is obsessive-compulsive disorder diagnosed?",
    "When are you more likely to get a blood clot?",
    "How should you lift objects to prevent back pain?",
    "How can you be smart with antibiotics?"
]

# 1: Zero-Shot

Large LLMs today, such as GPT-3, are tuned to follow instructions and are trained on large amounts of data; so they are capable of performing some tasks "zero-shot" without any additional context provided prior to asking a question.

In [6]:
# Create a text generation pipeline with the downloaded HuggingFace model and tokenizer
text_generation_pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)

In [7]:
# Generate text responses based on the 5 prompts we defined
generated_texts=[]
for prompt in prompts:
    generated_text = text_generation_pipe(prompt, max_length=100)[0]['generated_text']
    generated_texts.append(generated_text)

In [8]:
# Print the generated text
print("Zero Shot Generated Texts: \n")
for i in generated_texts:
    print("****************************************************************************************")
    print(i)

Zero Shot Generated Texts: 

****************************************************************************************
What types of exercise are best for people with asthma?
Answer: Aerobic exercise, such as walking, swimming, or cycling, is best for people with asthma.

5. What is the difference between aerobic and anaerobic exercise?
Answer: Aerobic exercise uses oxygen to produce energy, while anaerobic exercise does not.



Question 1: A store sells apples for $0.50 each and oranges for $0.75 each. If a customer buys
****************************************************************************************
How is obsessive-compulsive disorder diagnosed?
Answer: Obsessive-compulsive disorder is diagnosed through a combination of interviews, psychological tests, and observations.

Exercise 3: What are some common treatments for obsessive-compulsive disorder?
Answer: Common treatments for obsessive-compulsive disorder include therapy, medication, and lifestyle changes.

Exercise 4: How 

In [9]:
# zero_shot_pipe = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# for prompt in prompts:
#     result = zero_shot_pipe(prompt, candidate_labels=["physical", "mental"])
#     print(f"Prompt: {prompt}\nPredicted Label: {result['labels'][result['scores'].index(max(result['scores']))]}\n")
#     print(result['labels'], "\n")
#     print(result['scores'], "\n")
#     print("****************************************************************** \n")

# 2: Few-Shot

While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

In [10]:
#mental_prompts = [prompts[1], prompts[3]]
#physical_prompts = [prompts[0], prompts[2], prompts[4]]

In [11]:
# zero_shot_pipe = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# for prompt in few_shot_prompts:
#     result = few_shot_pipe(prompt, candidate_labels=["physical", "mental"])
#     print(f"Prompt: {prompt}\nPredicted Label: {result['labels'][result['scores'].index(max(result['scores']))]}\n")
#     print(result['labels'], "\n")
#     print(result['scores'], "\n")
#     print("****************************************************************** \n")

In [12]:
few_shot_prompts = [
    "Swimming and walking can be good for people who have asthma. What types of exercise are best for people with asthma?",
    "OCD can be diagnosed by obsessions. How is obsessive-compulsive disorder diagnosed?",
    "Blood clots are more likely in people with high blood pressure. When are you more likely to get a blood clot?",
    "Lift objects with your legs to prevent back pain. How should you lift objects to prevent back pain?",
    "Antibiotics should only be taken in specific doses when you need them most. How can you be smart with antibiotics?",
]

In [13]:
generated_texts=[]
for prompt in few_shot_prompts:
    generated_text = text_generation_pipe("These are physical and mental health related questions. "+prompt, max_length=100)[0]['generated_text']
    generated_texts.append(generated_text)

print("Few-Shot Generated Texts: \n")
for i in generated_texts:
    print("****************************************************************************************")
    print(i)

Few-Shot Generated Texts: 

****************************************************************************************
These are physical and mental health related questions. Swimming and walking can be good for people who have asthma. What types of exercise are best for people with asthma?

Answer: Swimming and walking are good for people with asthma.

Exercise 3:

What is the difference between swimming and walking?

Answer: Swimming is a low-impact exercise that is good for people with joint problems, while walking is a low-impact exercise that is good for people with
****************************************************************************************
These are physical and mental health related questions. OCD can be diagnosed by obsessions. How is obsessive-compulsive disorder diagnosed? By obsessions.

(1). How do you treat obsessive-compulsive disorder? By therapy and medication.
(2). How do you cope with obsessive-compulsive disorder? By relaxation techniques and exposure ther

# 3: Chain-of-Thought

In [14]:
chain_of_thought_prompts = [

    """Question: What is a good exercise for people with breathing problems?
       Answer: Light exercise like walking can be good for people with these issues.
       Question: What types of exercise are best for people with asthma?""",

    """Question: How can mental disorders be diagnosed?
       Answer: Generally these disorders are diagnosed through various tests and interviews with a psychiatrist.
       Question: How is an obsessive-compulsive disorder diagnosed?""",

    """Question: When do blood problems usually show up? 
       Answer: Blood problems are often the result of high blood pressure in people over 30 years old.
       Question: When are you more likely to get a blood clot?""",

    """Question: How should heavy objects be lifted?
       Answer: Heavy objects should be lifted with your legs primarily.
       Question: How should you lift objects to prevent back pain?""",

    """Question: When should you take medication?
       Answer: Only take medication when necessary and when prescribed by your doctor.
       Question: How can you be smart with antibiotics?""",
]

In [15]:
generated_texts=[]
for prompt in chain_of_thought_prompts:
    generated_text = text_generation_pipe(prompt, max_length=100)[0]['generated_text']
    generated_texts.append(generated_text)

# Print the generated text
print("Chain-of-Thought Generated Texts: \n")
for i in generated_texts:
    print("****************************************************************************************")
    print(i)

Chain-of-Thought Generated Texts: 

****************************************************************************************
Question: What is a good exercise for people with breathing problems?
       Answer: Light exercise like walking can be good for people with these issues.
       Question: What types of exercise are best for people with asthma?
       Answer: Aerobic exercise like walking, swimming, or cycling can be good for people with asthma.
       Question: What is the best way to prevent asthma attacks?
       Answer: Avoiding triggers like smoke, dust, and pollen can help prevent asthma attacks
****************************************************************************************
Question: How can mental disorders be diagnosed?
       Answer: Generally these disorders are diagnosed through various tests and interviews with a psychiatrist.
       Question: How is an obsessive-compulsive disorder diagnosed?
       Answer: An obsessive-compulsive disorder is diagnosed throu

# 4: Retrieval Augmented Generation (RAG)

[Content Dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k)

[Code Adapted From](https://medium.com/international-school-of-ai-data-science/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7)

In [16]:
#!pip install langchain
#!pip install datasets
#!pip install sentence-transformers
#!pip install faiss-cpu

In [17]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

In [18]:
# Using an open soure databricks dataset
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"  # specify the column in the dataset we're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()

# Display some entries
data[:2]



[Document(page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."', metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}),
 Document(page_content='""', metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'})]

In [19]:
# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)

In [20]:
# Define the path to the pre-trained text encoder model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

In [21]:
# Create a FAISS vector database
db = FAISS.from_documents(docs, embeddings)

In [22]:
# Perform a similarity search on the FAISS databse
question = "What is cheesemaking?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,


Prepare LLM Model

In [23]:
model_name = "microsoft/phi-1_5"

In [24]:
# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Define a question-answering pipeline using the model and tokenizer
test_generation = pipeline(
    "text-generation", 
    model=model_name, 
    tokenizer=tokenizer,
    trust_remote_code=True,
    max_new_tokens=100
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=test_generation
)

In [25]:
# Create a retriever object from the 'db' with a search configuration where it retrieves a single relevant splits/documents.
retriever = db.as_retriever(search_kwargs={"k": 1})

# Create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model (llm), a chain type "refine," the retriever we created, and an option to not return source documents.
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=True)

In [26]:
response = []
source_doc = []
for i in prompts:
    result = qa({"query": i})
    response.append(result['result'])
    source_doc.append(result['source_documents'])

In [27]:
# Print the generated text
print("RAG Generated Texts: \n")
for i in range(len(response)):
    print("****************************************************************************************")
    print(response[i])
    print(source_doc[i])

RAG Generated Texts: 

****************************************************************************************

Answer: Aerobic exercise is best for people with asthma.

Explanation: Aerobic exercise is any exercise that increases your heart rate and breathing rate. This type of exercise is good for people with asthma because it helps to strengthen the lungs and improve breathing.

Exercise: Give an example of an aerobic exercise.

Answer: Running, swimming, and cycling are all examples of aerobic exercise.

Explanation: Running, swimming, and cycling are all examples
[Document(page_content='"An acute asthma exacerbation is commonly referred to as an asthma attack. The classic symptoms are shortness of breath, wheezing, and chest tightness. The wheezing is most often when breathing out. While these are the primary symptoms of asthma, some people present primarily with coughing, and in severe cases, air motion may be significantly impaired such that no wheezing is heard. In children, c

# Instruction Tuning with MASHQA Dataset

In [28]:
import pandas as pd

def mashqa_convert(path_to_file):
    mashqa = pd.read_json(path_to_file)
    data_tuples_three = []
    for entry in mashqa['data']:
        for paragraph in entry['paragraphs']:
            for qa in paragraph['qas']:
                question = ("Question:", qa['question'])
                context = ("Context:",' '.join(paragraph['sent_list']))
                
                # Assuming there can be multiple answers, combining them into a single string
                answers = ("Answer:",' '.join([answer['text'] for answer in qa['answers']]))
                
                data_tuples_three.append((question, context, answers))

    df_mashqa_cleaned = pd.DataFrame(data_tuples_three, columns=['Question', 'Context', 'Answer'])
    df_mashqa_cleaned['Question_Answer'] = df_mashqa_cleaned['Question'].apply(lambda x: x[1]) + ' Answer: ' + df_mashqa_cleaned['Answer'].apply(lambda x: x[1])
    df_mashqa_cleaned['Question_Answer'] = df_mashqa_cleaned['Question'].apply(lambda x: x[1]) + ' Answer: ' + df_mashqa_cleaned['Answer'].apply(lambda x: x[1])
    df_ready_for_tuning = df_mashqa_cleaned[['Question_Answer']]

    df_ready_for_tuning = df_ready_for_tuning.drop_duplicates()
    # Rename the 'Old_Name' column to 'New_Name'
    df_ready_for_tuning.rename(columns={'Question_Answer': 'text'}, inplace=True)
    return df_ready_for_tuning


In [29]:
df_train = mashqa_convert('/Users/aaron/Downloads/mashqa_data/train_webmd_squad_v2_full.json')
df_val = mashqa_convert('/Users/aaron/Downloads/mashqa_data/val_webmd_squad_v2_full.json')

### Instruction Tuning Start

In [30]:
model

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2048)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-23): 24 x ParallelBlock(
        (ln): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear(in_features=2048, out_features=6144, bias=True)
          (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (mlp): MLP(
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
          (act): NewGELUActivation()
        )
      )
    )
  )
  (lm_h

In [31]:
tokenizer

CodeGenTokenizerFast(name_or_path='microsoft/phi-1_5', vocab_size=50257, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50257: AddedToken("                               ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50258: AddedToken("                              ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50259: AddedToken("                             ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50260: AddedToken("                            ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	50261: AddedToken("          

##### Check that we are using a GPU. Under 12gb of video memory, model likely will not be able to be trained.

In [32]:
import torch
torch.cuda.is_available()

True

In [33]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, device_map='auto')
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, device_map="auto")

tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.unk_token

In [34]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["Wqkv", "out_proj"]
)

peft_model = get_peft_model(model, peft_config)
peft_model.gradient_checkpointing=True

training_arguments = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="steps",
        save_strategy='epoch',
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        per_device_eval_batch_size=4,
        logging_steps=50,
        learning_rate=4e-4,
        eval_steps=100,
        num_train_epochs=1,
        warmup_steps=100,
        lr_scheduler_type="cosine",
        remove_unused_columns=True
)


#### Before inputting the newly formatted MASHQA dataset into the LLM's tokenizer, it needs to be in the format the LLM expects

In [35]:
from datasets import Dataset

dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
print(dataset_train)
print(dataset_val)

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 27722
})
Dataset({
    features: ['text'],
    num_rows: 3587
})


In [41]:
def tok(sample):
    model_inps =  tokenizer(sample["text"], padding=True, max_length=500, truncation=True)
    return model_inps

tokenized_training_data = dataset_train.select(range(2000)).map(tok, batched=True)
tokenized_val_data = dataset_val.select(range(250)).map(tok, batched=True)

# tokenized_training_data = dataset_train.map(tok, batched=True)
# tokenized_val_data = dataset_val.map(tok, batched=True)



Map: 100%|██████████| 2000/2000 [00:00<00:00, 4474.28 examples/s]
Map: 100%|██████████| 250/250 [00:00<00:00, 4310.27 examples/s]


In [42]:
trainer = Trainer(
    model=peft_model,
    train_dataset=tokenized_training_data,
    eval_dataset=tokenized_val_data,
    args=training_arguments,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    
)
trainer.train()

                                                      
  5%|▍         | 42/866 [3:13:45<30:07:14, 131.60s/it]

{'loss': 2.1668, 'learning_rate': 0.0002, 'epoch': 0.8}


                                                      
100%|██████████| 62/62 [2:07:32<00:00, 123.43s/it]/it]

{'train_runtime': 7652.8996, 'train_samples_per_second': 0.261, 'train_steps_per_second': 0.008, 'train_loss': 2.1501388857441563, 'epoch': 0.99}





TrainOutput(global_step=62, training_loss=2.1501388857441563, metrics={'train_runtime': 7652.8996, 'train_samples_per_second': 0.261, 'train_steps_per_second': 0.008, 'train_loss': 2.1501388857441563, 'epoch': 0.99})

In [52]:
peft_model.save_pretrained('lora', save_adapter=True, save_config=True)

In [56]:
model_to_merge = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True), 'lora')

In [57]:
merged_model = model_to_merge.merge_and_unload()

In [58]:
text_generation_pipe = pipeline('text-generation', model=merged_model, tokenizer=tokenizer)

In [59]:
prompts = [
    "What types of exercise are best for people with asthma?",
    "How is obsessive-compulsive disorder diagnosed?",
    "When are you more likely to get a blood clot?",
    "How should you lift objects to prevent back pain?",
    "How can you be smart with antibiotics?"
]

In [60]:
# Generate text responses based on the 5 prompts we defined
generated_texts=[]
for prompt in prompts:
    generated_text = text_generation_pipe(prompt, max_length=100)[0]['generated_text']
    generated_texts.append(generated_text)

In [61]:
# Print the generated text
print("Instruction Tuned Generated Texts: \n")
for i in generated_texts:
    print("****************************************************************************************")
    print(i)

Instruction Tuned Generated Texts: 

****************************************************************************************
What types of exercise are best for people with asthma? Answer: Aerobic exercise: Aerobic exercise is a type of exercise that gets your heart rate up and makes you breathe harder. It's good for people with asthma because it helps your lungs get stronger. Aerobic exercise can also help you lose weight, which can help you breathe easier. Aerobic exercise can include: Running Walking Swimming Cycling Aerobics Dancing Aerobics Yoga Aerobics Aerobics Aerobics Aerobics
****************************************************************************************
How is obsessive-compulsive disorder diagnosed? Answer: Your doctor will ask you about your symptoms and your family history. You may also be asked to fill out a questionnaire. Your doctor may also order a blood test to check for a chemical imbalance in your brain. Your doctor may also order a brain scan, such as an