### Implementing Query Re-writing strategy (Sub-query decomposition) + Retrieval Methodology (Hypothetical Document Embeddings - HyDE)
In this section, we aim to **improve document retrieval capabilities** of our RAG system (Research Assistant v2) through (1) Sub-query decomposition and (2) HyDE methodology. More information of these methodologies below. Future works would include RAFT (retrieval augmented fine-tuning) to tackle the answer generation phase of our answer generator LLM. 

**Query re-writing and HyDE**
Under the context of query re-writing to optimise query retrieval and answer quality in RAG systems, Hypothetical Document Embeddings (HyDE) and Query Decomposition are techniques that tackle different fronts with an external LLM. HyDE aims to improve retrieval “accuracy” through generated and embedded “hypothetical documents” to answer user queries. These hypothetical documents are then used in retrieving real documents from the vector store via doc-doc similarity search to produce more “relevant” matches.  

Query Decomposition aims to improve answer quality from complex queries. The process breaks down complex queries/questions into smaller sub-questions/problems, which can either be solved sequentially (use first answer + retrieval for 2nd to answer 2nd question) or in parallel (consolidate each answer separately to form final answer).  

Other query transformation techniques to tackle various “human-centric” limitations with user queries include multi-query (to tackle query ambiguity) and step-back (when a higher-level conceptual understanding is required for accurate retrieval).

In [1]:
# set autoreload for modules
%load_ext autoreload
%autoreload 2

# import dependencies
import os
import openai
from dotenv import load_dotenv, find_dotenv
import warnings
import nest_asyncio

_ = load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
nest_asyncio.apply()

**First and foremost, let's import and format our evaluation dataset**

In [3]:
import json
import pandas as pd
from llama_index.core.llama_dataset import (
    LabelledRagDataset,
    LabelledRagDataExample,
    CreatedBy,
)


def get_rag_dataset_from_csv(csv_path: str):
    converters = {
        "reference_contexts":    lambda s: json.loads(s),
        "query_by":             lambda s: CreatedBy.model_validate_json(s),
        "reference_answer_by":  lambda s: CreatedBy.model_validate_json(s),
    }
    df = pd.read_csv(csv_path, converters=converters)
    examples = []
    for _, row in df.iterrows():
        examples.append(
            LabelledRagDataExample(
                query=row["query"],
                query_by=row["query_by"],                      # now a CreatedBy
                reference_contexts=row["reference_contexts"],   # now a List[str]
                reference_answer=row["reference_answer"],
                reference_answer_by=row["reference_answer_by"], # now a CreatedBy
            )
        )
    # Create the dataset
    dataset = LabelledRagDataset(examples=examples)
    return dataset

In [3]:
eval_dataset = get_rag_dataset_from_csv("data/eval_dataset.csv")
len(eval_dataset.examples)

55

**Next, we re-build the query engine (our RAG System)**

In [4]:
from llama_index.core import (
    Settings,
    VectorStoreIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.openai import OpenAI

# Configure LLM
Settings.llm = OpenAI(model="gpt-4o-mini")

# Use custom embedding model - “hkunlp/instructor-large”
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# load embedding model (try) - loads https://huggingface.co/hkunlp/instructor-large
Settings.embed_model = HuggingFaceEmbedding(model_name="hkunlp/instructor-large")

In [5]:
from llama_index.llms.ollama import Ollama

# Instantiate query engine LLM - Set timeout to ___s to allow sufficient time for answer generation
llm = Ollama(model="llama3.2:1b", request_timeout=3000)

In [6]:
# Input documents (in index), embedding model and LLM to generate query engine (RAG system)
docs = SimpleDirectoryReader("../RAG-webscraper/docs/").load_data(show_progress=True)
index = VectorStoreIndex.from_documents(docs, embed_model=Settings.embed_model)
query_engine = index.as_query_engine(similarity_top_k=6, llm=llm)

Loading files: 100%|██████████| 1/1 [00:07<00:00,  7.69s/it]


#### 1. Building & Evaluating our Sub-Question Query Engine
**Let us first incorporate "sub-query decomposition" to our RAG system**

The sub-question query engine first breaks down the complex query into sub-questions, retrieves relevant documents from the data source, then gathers all the intermediate responses and finally, synthesizes a final response! 

In [8]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

# setup base query engine as tool
query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="docs",
            description="Project reports",
        ),
    ),
]

# wrap query engine tool in Sub-Question Query Engine object - Final Query Engine
query_engine_final = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True,
)

In [9]:
# Test response
response = query_engine_final.query(
    "How did Snape support Harry despite being a deatheater? On top of that, how did he hide his allegiance with the order from Voldermort?"
)

Generated 5 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What specific actions did Snape take to support Harry throughout the series?
[0m[1;3;38;2;90;149;237m[docs] Q: How did Snape's role as a double agent influence his interactions with Harry?
[0m[1;3;38;2;11;159;203m[docs] Q: What information did Snape provide to the Order of the Phoenix regarding Voldemort's plans?
[0m[1;3;38;2;155;135;227m[docs] Q: What strategies did Snape use to conceal his true allegiance from Voldemort?
[0m[1;3;38;2;237;90;200m[docs] Q: How did Snape's past experiences shape his decisions to protect Harry?
[0m[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, it can be inferred that Snape's role as a double agent influenced his interactions with Harry in several ways:

1. **Motivation**: As a double agent, Snape's primary motivation was to protect Harry from the Dark Lord Voldemort and ensure his own safety.
2. **Information sharing**: Although Snape did not reveal much about

In [10]:
print(response)

Snape supported Harry in several ways despite his affiliation with the Death Eaters. His actions included protecting Harry's life, which stemmed from his loyalty to Lily Potter, Harry's mother. Snape's commitment to keeping Harry safe was evident in his efforts to deflect the killing curse aimed at him and in his willingness to provide information that would aid Harry, such as suggesting strategies to mislead Voldemort.

To conceal his true allegiance from Voldemort, Snape employed various strategies. He pretended to propose plans, like using Polyjuice Potion to create decoys, which allowed him to gather intelligence on the Order of the Phoenix while maintaining the appearance of loyalty to Voldemort. Additionally, he manipulated situations to create a false sense of security around Harry, ensuring that Voldemort remained unaware of his true intentions. By acting in ways that aligned with Voldemort's expectations while secretly working to protect Harry, Snape successfully hid his alleg

**Food for thought:** *Sub-question query engine generated very detailed and granular responses as compared to ground-truth answers.*

*Response seems accurate!* **Let us now evaluate the "re-vamped sub-question query engine (after implementation of sub-question query decomposition)**

In [14]:
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

# Instantiate RAG Evaluator - input query engine, evaluation dataset, judge LLM & embeddings model
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine_final, 
    rag_dataset=eval_dataset,
    judge_llm=Settings.llm, #use the same llm that we use to create the dataset to judge
    embed_model=Settings.embed_model
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Processing /Users/jinkettyee/Desktop/my_GitHub/great-things/RAG-evaluation/pack
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: llama-index-packs-rag-evaluator
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): started
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): finished with status 'done'
  Created wheel for llama-index-packs-rag-evaluator: filename=llama_index_packs_rag_evaluator-0.3.1-py3-none-any.whl size=4935 sha256=dad027ca78706e90f46566970ee58c33a8d05a5fec0ef37e09c13ed24e871a8c
  Stored in directory: /private/var/folders/nr/6b6zx3jn687ghmtz2_2dw_b40000gn/T/pip-ephem-wheel-cache-g707gl3y/wheels/c5/b3/f2/e8724b5fcdbbb7

You should consider upgrading via the '/Users/jinkettyee/.pyenv/versions/great_things/bin/python -m pip install --upgrade pip' command.


In [15]:
# Run in async mode
nest_asyncio.apply()

# run evaluator function
benchmark_df = await rag_evaluator.arun()

Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 2 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What specific method did the team use for data labeling in their explicit image classification project?
[0m[1;3;38;2;90;149;237m[docs] Q: What advantages were identified by the team regarding the data labeling method used in the explicit image classification project?
[0mGenerated 2 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the main components of the machine learning system architecture described in the document?
[0m[1;3;38;2;90;149;237m[docs] Q: What is the purpose of each component in the machine learning system architecture as described in the document?
[0mGenerated 3 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the details of Group 9 in the eyecatcher project report?
[0m[1;3;38;2;90;149;237m[docs] Q: Can you provide the list of members in Group 9 from the eyecatcher project report?
[0m[1;3;38;2;11;159;203m[docs] Q: What roles do the members of Group 9 play in the eyecatcher project report?


Batch processing of predictions:  10%|█         | 1/10 [09:17<1:23:37, 557.50s/it]

[1;3;38;2;155;135;227m[docs] A: Amazon Rekognition uses an existing trained model for classification in its "DetectModerationLabels" method.
[0m[1;3;38;2;90;149;237m[docs] A: Implementing a machine learning tool for explicit image classification in social media platforms would likely have various cost implications. Here are some potential considerations:

1. **Data Collection and Labeling**: The project aims to use Amazon Rekognition as the data labeling solution, which may incur costs associated with accessing and utilizing Rekognition services. Additionally, collecting user feedback for concept drift detection and model drift monitoring could require additional personnel or resources.
2. **Model Training and Deployment**: The team will need to train a machine learning model on a large dataset of labeled images, which can be resource-intensive. Deploying the trained model in a scalable environment might involve additional costs for infrastructure, computing resources, and potential

Batch processing of predictions:  20%|██        | 2/10 [10:27<36:06, 270.86s/it]  

[1;3;38;2;11;159;203m[docs] A: The project proposes several strategies and methods to tackle the challenges mentioned in the provided context. 

To achieve a balance between different aspects of performance, such as latency, prediction accuracy, and robustness, the project suggests using an incremental deployment strategy, where new model versions are released to a small subset of users or environment before a full rollout. This approach allows for monitoring performance, measuring prediction accuracy, and identifying unexpected behaviors or anomalies before they affect all users.

Regarding adversarial attacks, the project proposes implementing robust verification processes for user feedback and deploying "defences" against such attacks through using Amazon SageMaker's defences and Canaries deployment strategy.

For mitigating concerns around data collection and usage, including privacy and consent, the project suggests addressing these issues through stringent data handling and usag

Batch processing of predictions:  30%|███       | 3/10 [10:58<18:48, 161.21s/it]

[1;3;38;2;237;90;200m[docs] A: The latest project report for Group 9 is available at page label 7. The report discusses the deployment strategy used in the project, including the Canary deployment approach and its benefits, as well as the use of CloudWatch alarms for managing rollback procedures during Canaries. Additionally, it touches upon the importance of monitoring and retraining the model to handle concept and model drift, highlighting the need for ongoing maintenance and adaptation to ensure the reliability and effectiveness of the machine learning system.
[0m[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, there are no specific achievements or milestones reported for Group 9. The query mentions Group 9 as Christover Abraham Manafe, Loh Kwang Peng Micheal, Low Siang Leng Henry, Yee Jin Kett, and Aeyecatcher.PY, but does not mention any achievements or milestones related to this group.
[0m

Batch processing of predictions:  40%|████      | 4/10 [11:20<10:36, 106.09s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, several common challenges can be identified in machine learning system architecture design:

1. **Balancing Model Performance and Resource Utilization**: As model size grows, balancing computational resources with performance is crucial to prevent overfitting or underperforming.
2. **Handling Adversarial Attacks**: Implementing robust verification processes for user feedback and deploying "defences" against adversarial attacks can help mitigate these risks.
3. **Managing Complexity and Scalability**: As the system scales up, managing complexity and ensuring scalability becomes increasingly challenging, particularly when dealing with large amounts of data and multiple stakeholders.
4. **Addressing Data Quality and Bias Issues**: Ensuring that data is accurate, complete, and unbiased is vital to maintaining model performance and fairness.
5. **Monitoring Performance and Identifying Issues Early**: Regular monitori

Batch processing of predictions:  50%|█████     | 5/10 [11:56<06:45, 81.13s/it] 

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, it appears that Amazon SageMaker Ground Truth is a model training and labeling service that uses human labelers to annotate data for training machine learning models. In contrast, Amazon Rekognition's "DetectModerationLabels" method utilizes computer vision technology to automatically detect and classify explicit content in images.

Using Amazon Rekognition's 'DetectModerationLabels' would likely have lower cost implications compared to relying on human labelers through Ground Truth, especially considering that the latter requires significant investment in hiring and training a team of data annotators.
[0m

Batch processing of predictions:  60%|██████    | 6/10 [12:16<04:00, 60.06s/it]

[1;3;38;2;11;159;203m[docs] A: The text does not explicitly mention any measures taken to maintain user trust and safety in the deployment of this tool. However, based on the context, it appears that the team has implemented several steps to ensure the safe and responsible use of their explicit image classification system:

1. User feedback mechanisms are in place for handling concept and model drift, as mentioned in Section 3.3.
3. Data labelling is performed using Amazon Rekognition's "DetectModerationLabels" function, which generates labels based on the images being classified NSFW or Safe/NSFW.
4. The team has identified a data imbalance issue in their dataset and plans to address it through preprocessing steps.

Additionally, the text mentions that the team will implement explainability techniques using SageMaker Clarify's SHAP values to provide insights into the model's decision-making process. This can help users understand how the model arrives at its predictions.

It is also 

Batch processing of predictions:  70%|███████   | 7/10 [13:15<02:59, 59.80s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, the following challenges have been identified in the development of the machine learning pipeline:

1. **Data Balance**: The model has found that it needs to adjust to being fine-tuned for better alignment with domain-specific data (common voice datasets), which improves its performance.
2. **Feature Alignment**: The team has observed that accent distributions across training and test sets are consistent, possibly explaining an improved fine-tuned performance. They also find "accent" distribution to be a key feature to speech variability.
3. **Fine-Tuning Limitations**: While model inference on the development set shows promising results, the distribution of WER metrics across our key feature "accent" is inconsistent, suggesting that this may not be enough to fine-tune the system's performance.
4. **Budgetary Constraints**: The social media platforms like TikTok and Instagram have invested heavily in machine lea

Batch processing of predictions:  80%|████████  | 8/10 [13:42<01:38, 49.44s/it]

[1;3;38;2;237;90;200m[docs] A: The main components of the machine learning system architecture described in the document include:

1. Amazon SageMaker - a cloud-based platform for building, training, and deploying machine learning models
2. Amazon S3 Training Bucket and Interim Bucket Stores - storage solutions for training images and reported/appealed images for moderators to evaluate and take appropriate action
3. Amazon SageMaker - Model Registry - a centralized repository of trained models that can be easily accessed and managed
4. AWS CodeCommit Store, AWS CodeBuild, AWS CodePipeline, CloudWatch, Lambda, API Gateway, and IAM - various components for managing source code, building models, deploying to production, monitoring model performance, and interacting with users

These components work together to provide a scalable, reliable, and high-performance image classification system.
[0m

Batch processing of predictions:  90%|█████████ | 9/10 [14:12<00:43, 43.21s/it]

[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, no specific auto-scaling policy is mentioned for model deployment. The project report discusses various aspects of the image classification model, such as machine learning system architecture, deployment strategy, and monitoring and retraining step, but does not provide details on any recommended auto-scaling policies for model deployment.

The only mention of scaling is in the context of deploying a new version of the model to a small subset of users or environment before a full rollout using the Canary deployment strategy. However, this does not imply that an auto-scaling policy is implemented for the entire project.
[0m[1;3;38;2;90;149;237m[docs] A: The text does not explicitly mention any specific data preprocessing techniques used in the project. However, it mentions that in the data preprocessing stage, the team will be extracting up to 1000 images per class and adopting an 80/10/10 split of training, va

Batch processing of predictions: 100%|██████████| 10/10 [17:21<00:00, 104.17s/it]
Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 4 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the latest project reports related to AWS CodeBuild?
[0m[1;3;38;2;90;149;237m[docs] Q: What are the common issues faced during the compilation of source code in AWS CodeBuild?
[0m[1;3;38;2;11;159;203m[docs] Q: What best practices are recommended for building models using AWS CodeBuild?
[0m[1;3;38;2;155;135;227m[docs] Q: Are there any recent updates or changes in AWS CodeBuild that could affect the build process?
[0mGenerated 5 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the key features of AWS CodePipeline for CI/CD automation?
[0m[1;3;38;2;90;149;237m[docs] Q: How does AWS CodePipeline integrate with other AWS services?
[0m[1;3;38;2;11;159;203m[docs] Q: What are the best practices for setting up AWS CodePipeline?
[0m[1;3;38;2;155;135;227m[docs] Q: Can you provide examples of successful implementations of AWS CodePipeline?
[0m[1;3;38;2;237;90;200m[docs] Q: What are the common challenges faced

Batch processing of predictions:  10%|█         | 1/10 [10:20<1:33:04, 620.48s/it]

[1;3;38;2;237;90;200m[docs] A: Based on the provided context information, it appears that you are discussing a project related to machine learning and cloud computing. However, I couldn't find any specific mention of "AWS CodeCommit" in your query.

However, based on the general use of CodeCommit as a Git repository for AWS services, here are some examples of successful projects:

1. **Amazon SageMaker**: SageMaker is an AI service offered by Amazon Web Services (AWS). One successful project using CodeCommit was the development and deployment of Amazon SageMaker models, including the creation of machine learning pipelines that integrated with other AWS services.
2. **Amazon Sumerian**: Sumerian is a cloud-based platform for building, publishing, and managing 3D content and experiences. A successful project using CodeCommit involved developing and deploying 3D models on Sumerian using Amazon SageMaker.
3. **AWS Greengrass**: Greengrass is an IoT (Internet of Things) service that allows

Batch processing of predictions:  20%|██        | 2/10 [12:04<42:14, 316.76s/it]  

[1;3;38;2;237;90;200m[docs] A: Based on the provided context information, the key features of Amazon SageMaker for model training include:

1. Training Bucket: Stores training images that will be converted into PyTorch Tensor for model training.
2. Interim Bucket: Stores reported/appealed images for moderators to evaluate and take appropriate action.
3. Model Registry: A catalogue of models to track and manage, containing a list of trained models.
4. Endpoint Deployments: Models are deployed as serverless computing services using AWS Lambda.
5. Inference Pipeline: Automates pipeline for CI/CD, allowing model inference in real-time.

Additionally, the query mentions SageMaker's features related to model training, including:

1. Model Building Workflow
2. Data Preprocessing (up to 1000 images per class)
3. Evaluation of trained models (requires a predefined level of accuracy before being added into the model registry)

These features provide a comprehensive overview of Amazon SageMaker'

Batch processing of predictions:  30%|███       | 3/10 [13:05<23:18, 199.77s/it]

[1;3;38;2;90;149;237m[docs] A: Yes, there are reports on the effectiveness of using an interim bucket to store reported images in an Amazon S3 environment. 

According to the provided context information, this is mentioned in Figure B, which describes the data collection pipeline and includes a section on the dataset statistics, where it states that "Despite the need for great training images, team feels that this will also allow the model to be more resilient against future content drifts."

Additionally, another report from (2023) titled "Why social media content moderation is important for online platforms & how it works." mentions Amazon S3 Interim Bucket, specifically mentioning that they are using interim bucket to store reported images.
[0m[1;3;38;2;237;90;200m[docs] A: The provided context information does not mention any recent or specific project reports related to AWS CodeBuild. However, based on the general information about AWS CodeBuild, it can be inferred that:

AWS C

Batch processing of predictions:  40%|████      | 4/10 [13:47<13:46, 137.76s/it]

[1;3;38;2;237;90;200m[docs] A: The deployment of an image classification model using Amazon SageMaker involves several stages. Here's an overview of the steps:

1. **Model Training**: Train the model on labeled training data in SageMaker.
2. **Data Preprocessing**: Prepare the input data, including image preprocessing and data augmentation.
3. **Model Packaging**: Package the trained model into a compatible format for deployment (e.g., TensorFlow or PyTorch).
4. **Code Build**: Compile the model code using a build framework (e.g., AWS CodeBuild) to create a deployable package.
5. **Package Deployment**: Upload the packaged model and its dependencies to Amazon S3.
6. **Automated Deployment**: Set up automated deployment of the model into production using CloudWatch Events and Lambda functions.
7. **CodePipeline Stages for Model Deployment**: Establish a continuous integration/continuous delivery (CI/CD) pipeline that automates the entire process, including:
	* Code Commit: Check out ch

Batch processing of predictions:  50%|█████     | 5/10 [15:32<10:28, 125.67s/it]

[1;3;38;2;155;135;227m[docs] A: There is no specific mention of appeals or evaluations being documented in the provided reports concerning the images stored in the Amazon S3 Interim Bucket. However, it is mentioned that in the Data Labelling section, the team used Amazon Rekognition's "DetectModerationLabels" method to generate paren t labels and child sub -labels for each NSFW image, which may involve some form of evaluation or review process.

It is also mentioned that as part of the implementation of user feedback – Discord Server Bot, the team added a CloudWatch alarm to monitor the number of failed invocations of their image classification model in production environment, indicating a potential need for evaluation and monitoring.
[0m

Batch processing of predictions:  60%|██████    | 6/10 [15:55<06:03, 90.83s/it] 

[1;3;38;2;237;90;200m[docs] A: The ResNet50 model has several architectural components. 

1. Layers: The model consists of multiple convolutional layers followed by pooling layers, then fully connected layers.

2. Convolutional Blocks: 
- The first three blocks are standard convolutional layers with kernel size 3x3 and stride 1.
- The fourth block is a dilated convolutions with kernel size 7x7 and stride 4, which increases the spatial dimensions by a factor of 2 (while preserving depth).
- The fifth block is another instance of dilated convolutions but with kernel size 15x15 and stride 1.

3. Depthwise Convolution: 
- A depthwise convolution layer followed by an activation function (ReLU).

4. Batch Normalization: 
- Each convolutional block, including the second three blocks, uses batch normalization.
5. Residual Connections:
- The fourth block features residual connections to allow for easier training and a more consistent loss function.

6. Dropout: Not explicitly mentioned in all 

Batch processing of predictions:  70%|███████   | 7/10 [17:29<04:35, 91.78s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, here are some typical use cases for ResNet50 and ViT-16 in real-world applications:

**ResNet50:**

1. **Image classification**: ResNet50 is commonly used for image classification tasks, such as object detection, facial recognition, and image segmentation.
2. **Computer vision**: It can be applied to various computer vision tasks, including object detection, tracking, and scene understanding.
3. **Gaming and entertainment**: Games that use AI-powered characters or environments may utilize ResNet50-based models for texture analysis, character rendering, or environmental effects.
4. **Healthcare and medical imaging**: Due to its robust architecture and ability to learn complex patterns, ResNet50 can be applied in medical image analysis tasks such as disease detection, tumor segmentation, and patient diagnosis.

**ViT-16:**

1. **Image classification**: ViT-16 is designed for image classification tasks, leveraging 

Batch processing of predictions:  80%|████████  | 8/10 [19:57<03:39, 109.74s/it]

[1;3;38;2;155;135;227m[docs] A: The monitoring tools available in Amazon SageMaker for tracking endpoint performance include:

1. CloudWatch Alarms
2. AWS CodePipeline
3. CloudWatch Logs
4. SageMaker Monitoring API (for real-time monitoring)

These tools can be used to monitor and manage the performance of SageMaker endpoints, including metrics such as model latency, throughput, CPU usage, and more.
[0m

Batch processing of predictions:  90%|█████████ | 9/10 [20:10<01:19, 79.55s/it] 

[1;3;38;2;237;90;200m[docs] A: Based on the provided context information, here is an analysis of the advantages and disadvantages of using ResNet50 compared to Vision Transformer (ViT-16) for explicit image classification:

**Advantages of ResNet50:**

1. **Established Model**: ResNet50 is a widely used and well-established model in the field of deep learning, with a large body of research and development.
2. **High Accuracy**: ResNet50 has achieved state-of-the-art results in many benchmarks, with an accuracy of around 60% as mentioned in the provided context.
3. **Wide Range of Applications**: ResNet50 is applicable to various image classification tasks, including explicit content detection.

**Disadvantages of ResNet50:**

1. **Computationally Intensive**: ResNet50 is a complex model that requires significant computational resources for training and inference.
2. **Requires Significant Data**: ResNet50 requires large amounts of data to train effectively, which can be challenging to

Batch processing of predictions: 100%|██████████| 10/10 [21:22<00:00, 128.21s/it]
Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 4 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What project reports are available that discuss content moderation?
[0m[1;3;38;2;90;149;237m[docs] Q: What project reports are available that discuss image classification?
[0m[1;3;38;2;11;159;203m[docs] Q: Are there any specific case studies in the project reports related to content moderation?
[0m[1;3;38;2;155;135;227m[docs] Q: Are there any specific case studies in the project reports related to image classification?
[0mGenerated 4 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the key benefits of using a Canary deployment strategy in software development?
[0m[1;3;38;2;90;149;237m[docs] Q: How does a Canary deployment strategy specifically apply to machine learning models, such as image classification?
[0m[1;3;38;2;11;159;203m[docs] Q: What challenges are associated with implementing a Canary deployment strategy for an image classification model?
[0m[1;3;38;2;155;135;227m[docs] Q: Can you provide example

Batch processing of predictions:  10%|█         | 1/10 [10:40<1:36:08, 640.92s/it]

[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, here are some best practices for implementing a Canary deployment strategy:

1. **Start small**: Begin with a small subset of users or environments and gradually scale up as needed.
2. **Test thoroughly**: Conduct thorough testing before deploying to larger groups to identify potential issues and optimize deployment strategies.
3. **Monitor performance**: Continuously monitor model performance, latency, and other key metrics during the deployment process.
4. **Implement Canary notifications**: Set up notifications for when a new version of the model is deployed to ensure timely response and minimize impact on users.
5. **Use automation tools**: Leverage automation tools to streamline the deployment process, including data validation, testing, and rollout.
6. **Prioritize scalability**: Design the deployment infrastructure to scale horizontally as needed, ensuring that the system can handle increased traffic and 

Batch processing of predictions:  20%|██        | 2/10 [11:55<41:01, 307.63s/it]  

[1;3;38;2;237;90;200m[docs] A: Based on the provided context, the following tools or frameworks are recommended for managing Canary deployments:

1. **CloudWatch Alarms**: As mentioned in Figure J, CloudWatch alarms are used to manage rollback procedures during Canary deployments.
2. **CodePipeline**: CodePipeline is a tool that automates the build, test, and deployment process. It can be used to implement the Canary deployment strategy.
3. **AWS CodeCommit**: AWS CodeCommit is a version control system that allows for managing code changes. It can be used to track changes in the model training pipeline.
4. **AWS CodeBuild**: AWS CodeBuild is a continuous integration/continuous deployment (CI/CD) service that builds and deploys software components. It can be used to automate the build, test, and deployment process of Canary deployments.

Additionally, it's recommended to consider implementing:

1. **Monitoring**: Monitoring model performance and latency using tools like CloudWatch or A

Batch processing of predictions:  30%|███       | 3/10 [14:55<29:06, 249.46s/it]

[1;3;38;2;155;135;227m[docs] A: There are several case studies mentioned in the project report that relate to image classification. Here are a few examples:

1. **Modulating NSFW content**: The project team developed an explicit nudity detection model using Amazon SageMaker Clarify, which was used to modulate NSFW (Not Safe for Work) content on social media platforms.
2. **Detecting suggestive thumbnails**: A case study mentioned that the project team investigated the use of machine learning models to detect suggestive thumbnails in online images.
3. **Monitoring CPU Utilization**: The project team set up a monitoring system using CloudWatch to track CPU utilization, which helped them optimize model performance and reduce computational costs.

These case studies demonstrate how image classification has been applied in various scenarios to address challenges related to explicit content moderation, detection of suggestive thumbnails, and optimizing model performance.
[0m[1;3;38;2;90;1

Batch processing of predictions:  40%|████      | 4/10 [15:42<16:58, 169.72s/it]

[1;3;38;2;237;90;200m[docs] A: According to the provided context, CloudWatch alarms are being used for managing rollback procedures during Canyons deployments. The chosen metric is `InvocationModelErrors`, which indicates the number of model errors that occur when deploying models to production environments.
[0m[1;3;38;2;237;90;200m[docs] A: Based on the provided context information, no specific tool or technology recommendations are mentioned. However, I can provide some general guidance on managing model prediction latency.

To manage model prediction latency in a deployment infrastructure, consider the following technologies and tools:

1. **Model Serving**: Utilize a model serving platform like Amazon SageMaker Model Registry, Google Cloud AI Platform Model Management, or Microsoft Azure Machine Learning Model Management. These platforms provide features such as scalability, secure data management, and automated deployment.
2. **CloudWatch Alarms**: Set up CloudWatch alarms to m

Batch processing of predictions:  50%|█████     | 5/10 [18:26<13:57, 167.55s/it]

[1;3;38;2;237;90;200m[docs] A: Monitoring CPU utilization is significant for machine learning models, including image classification models, as it can impact their performance and reliability. Here are some key reasons why:

1. **Performance optimization**: By understanding how much CPU power the model requires to function correctly, teams can optimize their model's training and inference processes to minimize latency and maximize accuracy.
2. **Resource allocation**: Monitoring CPU utilization helps teams determine the optimal number of instances or resources required to support the model in different scenarios, ensuring that users are not overloaded or underpowered.
3. **Early detection of issues**: By tracking CPU usage over time, teams can identify potential problems before they become critical, such as overheating or resource exhaustion, which could lead to decreased performance or even model crashes.
4. **Cost optimization**: Monitoring CPU utilization enables teams to identify 

Batch processing of predictions:  60%|██████    | 6/10 [20:07<09:39, 144.80s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, several common issues have been identified and discussed in various sections of the report:

1. **Model Latency**: Issues may include:
	* Insufficient testing or data collection to accurately define latency thresholds.
	* Inadequate scaling of CloudWatch alarms for timely notification.
	* Complexity in integrating with real-time infrastructure (e.g., AWS/SageMaker).
2. **Adversarial Attacks**: Mitigation strategies could include:
	* Implementing robust verification processes for user feedback and deploying "defences" against adversarial attacks.
3. **Data Imbalance**: Issues may involve:
	* Inadequate data labeling or preprocessing to ensure representative samples.
	* Difficulty in defining a suitable threshold for ModelLatency due to varying load patterns.

By addressing these common issues, the team can improve the effectiveness of Canary deployments and enhance overall system reliability and performance consi

Batch processing of predictions:  70%|███████   | 7/10 [20:24<05:08, 102.96s/it]

[1;3;38;2;237;90;200m[docs] A: The context provided does not mention what specific user feedback has been collected through the Discord server bot. However, it does mention that a Discord bot template is being used for implementing a user feedback loop in the image classification project.

It appears that the Discord bot is designed to allow users to upload images, and when an NSFW (Not Safe For Work) image is uploaded, the bot sends a message to the moderator notification channel with relevant details. The moderators can then appeal to Amazon ModelAppeal Lambda for further review.

The specific user feedback collected through this system includes:

* A "Timeout" action that times out the user and deletes the message.
* An "Auto-Moderating actions: Timeout the user (10 seconds) and delete the message."
* An "Sends a message in the moderator notification channel with relevant details."
[0m

Batch processing of predictions:  80%|████████  | 8/10 [20:56<02:41, 80.52s/it] 

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, here are some examples of project reports that discuss CPU utilization monitoring in image classification models:

1. **A Report by Amazon Web Services (AWS)**: This report discusses the importance of monitoring CPU utilization in image classification models for scalable and reliable services.

"CPU Utilization Monitoring: A Key Consideration for Scalable Image Classification Models"

This report highlights the need to monitor CPU utilization in image classification models to ensure they can handle a large volume of requests without experiencing performance degradation.

2. **A Study on Real-time Image Processing**: This study focuses on implementing real-time image processing using image classification models and monitoring CPU utilization during deployment.

"The Impact of Real-time Image Processing on Image Classification Models"

This study demonstrates the importance of monitoring CPU utilization in real-ti

Batch processing of predictions:  90%|█████████ | 9/10 [21:39<01:08, 68.90s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided text, several challenges can be identified associated with implementing a Canary deployment strategy for an image classification model:

1. **Monitoring Model Performance**: The model's performance is critical to ensure that it remains accurate and reliable during the deployment process.
2. **Detecting Potential Issues**: Asynchronous inference setup may introduce new risks, such as adversarial attacks, which can degrade model performance over time.
3. **Scalability and Flexibility**: Canary deployments require a flexible architecture to accommodate changes in data distribution or user behavior patterns.
4. **Latency and Responsiveness**: Maintaining low latency is crucial during real-time applications like image classification, where users expect immediate responses.
5. **Ensuring Consistency Across Different Scenarios**: The deployment strategy must be able to handle various scenarios, such as different cloud environments or data 

Batch processing of predictions: 100%|██████████| 10/10 [22:15<00:00, 133.59s/it]
Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 2 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What is the Detailed Architecture for Model Building as illustrated in Figure F?
[0m[1;3;38;2;90;149;237m[docs] Q: What are the CodePipeline Stages for Model Building as outlined in Figure G?
[0mGenerated 2 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What figures are mentioned in the project reports related to the model building process?
[0m[1;3;38;2;90;149;237m[docs] Q: Can you provide specific details about the figures related to the model building process in the project reports?
[0mGenerated 3 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the key statistics of the eyecatcher project dataset?
[0m[1;3;38;2;90;149;237m[docs] Q: Can you provide a summary of the project reports related to the eyecatcher project?
[0m[1;3;38;2;11;159;203m[docs] Q: What insights can be drawn from the dataset statistics of the eyecatcher project?
[0mGenerated 3 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What types of conte

Batch processing of predictions:  10%|█         | 1/10 [04:27<40:05, 267.32s/it]

[1;3;38;2;90;149;237m[docs] A: Inadequate content moderation can have severe consequences for online platforms, including:

1. **Erosion of User Trust**: Inconsistent or inaccurate content moderation can lead to users feeling misled or deceived, causing them to abandon the platform.
2. **Violations of Community Standards**: Content that is not moderated can violate community standards, leading to penalties or bans on users who post such material.
3. **Reputation Damage**: A platform with inadequate content moderation may be perceived as a risk for sensitive topics, such as explicit imagery or hate speech, damaging its reputation.
4. **Increased Risk of Cyberbullying and Harassment**: Unmoderated content can facilitate bullying and harassment, particularly against individuals who are vulnerable to exploitation online.
5. **Financial Losses**: Online platforms that fail to effectively moderate their content may experience financial losses due to the costs associated with addressing user

Batch processing of predictions:  20%|██        | 2/10 [07:13<27:43, 207.88s/it]

[1;3;38;2;237;90;200m[docs] A: The Detailed Architecture for Model Building as illustrated in Figure F is a multi-stage process. Here's an overview of each stage:

1. **Model Training**: The model is initially trained using pre-trained models such as ResNet50 or ViT-16.
2. **Fine-tuning**: The fine-tuned last fully connected classifier layer of the pre-trained model (ResNet50) and a fine-tuned ViT-16 are used to train a new model for image classification tasks.
3. **Model Quantization**: A post-training quantization process is applied to reduce the precision of weights while retaining similar performance.

The stages are further broken down into several sub-stages:

* Building a package from the repository, which encompasses both staging and production deployment CloudFormation templates
* Updating the stacks in CloudFormation using the template
* Executing an inference test on the staging endpoint

This process is designed to be incremental, with the model being deployed to staging f

Batch processing of predictions:  30%|███       | 3/10 [07:30<14:06, 120.90s/it]

[1;3;38;2;90;149;237m[docs] A: The project reports discuss several image classification techniques. Here are some examples:

1. **Deep Residual Learning for Image Recognition**: The project mentions that one of the feasible options was to use Amazon SageMaker Ground Truth, but ultimately decided to leverage existing pre-labeled datasets and consolidate images using Amazon Rekognition's "DetectModerationLabels" method.
2. **Post-training Quantization**: The team introduced post-training quantization to reduce the precision of weights in models while retaining similar performance. This involves reducing the model's capacity without compromising its accuracy, which can help with computational resources and potential overfitting.

These techniques are mentioned in various sections of the project reports, including Data Collection & Project Datasets (2.1), Model Training (3.1), Model Deployment (4.2), and Limitations, Considerations & Future Works (4.2).
[0m[1;3;38;2;90;149;237m[docs] A:

Batch processing of predictions:  40%|████      | 4/10 [09:47<12:42, 127.12s/it]

[1;3;38;2;11;159;203m[docs] A: Based on Figure D in the eyecatcher project report, it appears to be a table of dataset statistics. The table lists various metrics, such as:

* Number of features (14)
* Number of samples (10,000)
* Class imbalance ratio (0.05)
* Data distribution (approximately 70% positive and 30% negative classes)

These statistics suggest that the dataset is relatively imbalanced, with a small number of positive instances and many more negative instances. This could indicate that the model may struggle to accurately predict positive outcomes, which might affect its performance in certain scenarios or tasks.
[0m[1;3;38;2;237;90;200m[docs] A: Figure D in the eyecatcher project report displays dataset statistics.
[0m

Batch processing of predictions:  50%|█████     | 5/10 [10:01<07:11, 86.24s/it] 

[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, here is a summary of the project reports related to the EYECATCHER Project:

**Project Overview**

The EYECATCHER Project appears to be an image classification project developed by a research group at [University Name], focused on machine learning engineering. The project aims to deploy a model for real-time inference and has been tested with various architectures, including ResNet50, Vision Transformer (ViT-16), and others.

**Project Reports**

Two reports are mentioned in the context:

1. **Report 13**: Provides an overview of the EYECATCHER Project's architecture, deployment workflow, and model training process using ResNet50 as a baseline.
2. **Report 14**: Outlines the EYECATCHER Project's detailed architecture for model building, including the use of Vision Transformer (ViT-16), and discusses its deployment process.

**Methodology**

Both reports highlight various methodologies used in the project:

* Mod

Batch processing of predictions:  60%|██████    | 6/10 [10:23<04:17, 64.43s/it]

[1;3;38;2;237;90;200m[docs] A: Yes, there are several case studies and success stories related to image classification using SageMaker Clarify. Here are a few examples:

1. **Detecting Nudity in Images**: NotAI.tech's NudeNet is an image classification model that uses SageMaker Clarify to detect nude images. The model achieves a high accuracy of 99% in detecting images with low false positive rates.

2. **Real-time Image Classification for Social Media Content Moderation**: Amazon's Rekognition service, combined with SageMaker Clarify, enables real-time image classification for social media content moderation. This allows users to quickly identify and flag potentially sensitive or explicit content.

3. **Improving Model Performance in Large-Scale Image Classification Pipelines**: A case study on the Kaggle platform showcases how SageMaker Clarify can be used to improve the performance of large-scale image classification pipelines by reducing overfitting and improving model generalizat

Batch processing of predictions:  70%|███████   | 7/10 [11:11<02:57, 59.03s/it]

[1;3;38;2;237;90;200m[docs] A: Based on the provided context, it appears that Amazon Rekognition is used for explicit image classification, which involves detecting and filtering out explicit images such as nudity. In comparison to other AWS tools for content moderation, Amazon Rekognition stands out for its ability to accurately detect and classify explicit images.

Compared to Google Cloud Vision API, Amazon Rekognition has a more comprehensive set of features, including the ability to detect multiple types of contents (e.g., text, faces, objects), as well as support for image classification, object detection, and facial recognition. Additionally, Amazon Rekognition is designed specifically for use with machine learning models, making it easier to integrate into existing workflows.

In comparison to Microsoft Azure Computer Vision, Amazon Rekognition offers more flexible and scalable solutions, allowing users to adapt the model architecture to their specific needs. However, Azure Co

Batch processing of predictions:  80%|████████  | 8/10 [14:50<03:40, 110.09s/it]

[1;3;38;2;237;90;200m[docs] A: The key finding of the paper "Deep Residual Learning for Image Recognition" by Kaiming He et al. is that ResNet50, a deep convolutional neural network (CNN) introduced in their work, achieved state-of-the-art results in image recognition tasks and remains one of the most popular models due to its simplicity.

The authors also highlight that Vision Transformer (ViT-16), another image classification architecture, performed better than ResNet50 on explicit content detection. Furthermore, they demonstrate that fine-tuning a pre-trained ViT model on a specific task can improve its performance compared to using a pre-trained model without modification.

Additionally, the paper discusses how Amazon SageMaker, a cloud-based machine learning platform, enables data preprocessing, feature engineering, model training, and deployment of deep learning models. The authors also discuss the importance of explaining the decision-making process of AI models through techniq

Batch processing of predictions:  90%|█████████ | 9/10 [15:23<01:25, 85.75s/it] 

[1;3;38;2;11;159;203m[docs] A: The key findings from the project reports related to content moderation on social media include:

1. The need for explicit image classification in detecting and filtering out explicit images, including nudity and sexual exposure.
2. The importance of machine learning algorithms and techniques in developing a deployable and cost-effective solution for content moderation.
3. The use of Amazon SageMaker and AWS Lambda as the cloud-native platform for building and deploying the machine learning pipeline.
4. The development of a data collection process using existing pre-labeled datasets, Google Safe Search images, and Amazon Rekognition to improve the quality and accuracy of the dataset.
5. The implementation of user feedback mechanisms through Discord servers to handle model drift and detect inconsistencies in the training data.
6. The use of CloudWatch alarms to monitor key performance metrics such as model invocation errors and optimize infrastructure for

Batch processing of predictions: 100%|██████████| 10/10 [15:49<00:00, 94.97s/it]
Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 3 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What is the content of Figure H in the eyecatcher project report?
[0m[1;3;38;2;90;149;237m[docs] Q: What does Figure H illustrate or represent in the context of the eyecatcher project?
[0m[1;3;38;2;11;159;203m[docs] Q: Are there any specific details or annotations related to Figure H in the eyecatcher project report?
[0mGenerated 4 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the specific stages involved in a typical CodePipeline for deploying machine learning models?
[0m[1;3;38;2;90;149;237m[docs] Q: How does each stage in the CodePipeline contribute to the overall deployment process of machine learning models?
[0m[1;3;38;2;11;159;203m[docs] Q: What best practices should be followed for each stage of the CodePipeline to ensure efficient deployment of machine learning models?
[0m[1;3;38;2;155;135;227m[docs] Q: Can you provide examples of successful CodePipeline implementations for machine learning model de

Batch processing of predictions:  10%|█         | 1/10 [04:58<44:48, 298.74s/it]

[1;3;38;2;155;135;227m[docs] A: Yes, based on the provided context information, it appears that there are several project reports related to the implementation of SageMaker Clarify.

One example is the report titled "Amazon SageMaker Examples: SageMaker Clarify" which is available at https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/computer_vision/image_classification/explainability_image_classification.html. This report provides an overview of how to use SageMaker Clarify for explainable image classification.

Another example is the report titled "Deep Residual Learning for Image Recognition" which was published in arXiv:1512.03385 and can be accessed at https://arxiv.org/abs/1512.03385. This report discusses a deep learning model developed using SageMaker Clarify for image recognition tasks.

Additionally, there is an Appendix titled "Appendix: Figure E, F, G" which appears to be related to the implementation of SageMaker Clarify in terms of its architecture and

Batch processing of predictions:  20%|██        | 2/10 [05:21<18:12, 136.60s/it]

[1;3;38;2;11;159;203m[docs] A: The question pertains to the significance of a WER (Word Error Rate) score of 7.3% in the context of speech recognition models.

A WER score is a metric used to evaluate the accuracy of automatic speech recognition (ASR) systems, which are designed to transcribe spoken words into written text. In this specific query, the WER score of 7.3% indicates that a speech recognition model has achieved an acceptable level of accuracy in transcribing spoken words.

A low WER score, such as 10-20%, suggests that the model is not accurately recognizing the intended words or phrases. A score of 7.3% implies that the model has made some errors in transcription, but it is still capable of producing coherent and readable text from the input audio signal.

The significance of a WER score of 7.3% can be attributed to several factors:

1. **Improved accuracy**: The score suggests that the speech recognition model has improved its accuracy over baseline models.
2. **Better a

Batch processing of predictions:  30%|███       | 3/10 [06:17<11:36, 99.50s/it] 

[1;3;38;2;237;90;200m[docs] A: Based on the provided text, the following are the key findings:

1. The ResNet50 model achieved ~20% accuracy in classification tasks, whereas ViT-16 achieved ~60%.
2. ViT-16 outperformed the rest of the models, which indicates that it is a strong contender for the project.
3. Post-training quantization was introduced to reduce the precision of weights while maintaining similar performance.
4. The un-quantized model (ViT-16) was deployed as the deviation between 5% threshold set and the accuracy of the quantized model.
5. A deployment workflow was established to ensure reliability, including building a package from repository, updating CloudFormation template, and executing inference test on staging endpoint.
[0m[1;3;38;2;237;90;200m[docs] A: Based on the provided context information, here is the specific stage involved in a typical CodePipeline for deploying machine learning models:

1. **Code Build**: The code is compiled and built into a model.
2. *

Batch processing of predictions:  40%|████      | 4/10 [06:51<07:23, 73.93s/it]

[1;3;38;2;155;135;227m[docs] A: The text mentions several metrics that are used to evaluate the accuracy of the wav2vec2 model after implementing the proposed strategies. These include:

1. Word Error Rate (WER): This metric evaluates the system's ability to learn more about the context of predictions in English.
2. Character Error Rate (CER): While WER is more widely used, CER penalizes minor spelling errors much less than WER.
3. Precision: Not explicitly mentioned in the text, but implied as a factor in evaluating model performance and concept drift.

Additionally, metrics such as Model Latency are also mentioned, specifically with regards to deployment infrastructure and cloudwatching. However, these metrics seem more related to ensuring system reliability and responsiveness rather than directly evaluating accuracy or model performance.
[0m[1;3;38;2;237;90;200m[docs] A: The project reports mention several strategies to improve the accuracy of the wav2vec2 model. Specifically:

1

Batch processing of predictions:  50%|█████     | 5/10 [08:23<06:41, 80.37s/it]

[1;3;38;2;90;149;237m[docs] A: The team has proposed three main strategies that can be used to deploy models: 

1. Canary deployment 
2. Auto Scaling Policy 
3. Deployment Strategy
[0m

Batch processing of predictions:  60%|██████    | 6/10 [08:34<03:46, 56.54s/it]

[1;3;38;2;11;159;203m[docs] A: The projects report identifies several challenges related to fine-tuning the wav2vec2 model for different accents. 

These challenges include:

- Improvements in WER scores across most regions, indicating successful accent mapping.
- Notably, countries like Singapore and Africa recorded strong improvements while countries like Philippines and India shows less improvements.
- The project proposes a multi-faceted approach to improve accuracy by diversifying datasets, augmenting techniques, integrating external language models, and tuning hyperparameters.

However, the report also mentions that some regions show little improvement. This could be due to unique speech nuances and pronunciations in those countries, which may require more work to explore potential solutions.
[0m

Batch processing of predictions:  70%|███████   | 7/10 [09:01<02:21, 47.19s/it]

[1;3;38;2;237;90;200m[docs] A: 10.8%
[0m[1;3;38;2;90;149;237m[docs] A: According to the eye-catcher project report, the following are the stages in the CloudFormation template that contribute to the model deployment process:

1. **Figure I: CodePipeline Stages for Model Deployment**: This stage determines the strategy used for deploying the model.
2. **Figure G: CodePipeline Stages for Model Building**: While not directly related to model deployment, this stage is mentioned as a reference point for understanding the model-building workflow, which indirectly contributes to the overall deployment process.

The deployment stages themselves are:

1. **Stage approval** after successful testing in the staging environment.
2. **Deployment strategy** used in CloudFormation template (Figure G).
3. **Auto Scaling Policy**, where the team adopted a Canary deployment strategy with a target value of 70 model invocation errors per minute, cooldowns for 5 minutes and 10 minutes, and adjustments ac

Batch processing of predictions:  80%|████████  | 8/10 [09:56<01:39, 49.66s/it]

[1;3;38;2;90;149;237m[docs] A: The text reports the following evaluation metrics for the 'wav2vec2-large-960h' model on the cv-valid-test dataset:

1. Word Error Rate (WER) 
2. Accurate WER 
3. Precise WER 
4. Specificity of WER 
5. Sensitivity of WER
[0m[1;3;38;2;155;135;227m[docs] A: Yes, according to the provided text, there is a mention of the "cv-valid-test" dataset in one of the project reports. Specifically, it is mentioned that the fine-tuned "wav2vec2-large-960h" model achieved an improvement in Word Error Rate (WER) from 7.3% to 12.0% compared to the pre-trained baseline model on the cv-valid-test dataset.
[0m[1;3;38;2;237;90;200m[docs] A: The key components of the model deployment architecture illustrated in Figure H for the image classification model are:

1. CloudFormation template included in the package
2. Deployment strategy used, which is Canary deployment with a scale out cooldown of 5 minutes and a scale in cooldown of 10 minutes
3. Auto Scaling Policy that will

Batch processing of predictions:  90%|█████████ | 9/10 [11:41<01:06, 66.84s/it]

[1;3;38;2;155;135;227m[docs] A: Based on the provided context information and not prior knowledge, I can provide a general overview of how these strategies were used in previous instances.

The project has successfully deployed models using several deployment strategies. For instance, they have utilized CloudWatch alarms to monitor rollback procedures during Canary deployments, which helped mitigate risks associated with deploying new models. Additionally, the team implemented metrics such as `InvocationModelErrors` for managing rollback procedures and user feedback to handle concept/model drift.

In another part of the report, it mentions that they set up a suitable baseline for Model Prediction Latency in the staging phase. This allowed them to monitor trends and patterns of latency under real-world conditions, which helped in setting an acceptable threshold.

The deployment infrastructure also shows improvement as they plan to move from real-time inference to asynchronous inference

Batch processing of predictions: 100%|██████████| 10/10 [12:04<00:00, 72.43s/it]
Batch processing of predictions:   0%|          | 0/5 [00:00<?, ?it/s]

Generated 5 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What are the best practices for hyperparameter tuning in machine learning models?
[0m[1;3;38;2;90;149;237m[docs] Q: Can you provide examples of random search and Bayesian optimization methodologies for hyperparameter tuning?
[0m[1;3;38;2;11;159;203m[docs] Q: What metrics should be used to evaluate the performance of the model after hyperparameter tuning?
[0m[1;3;38;2;155;135;227m[docs] Q: Are there any case studies or reports on successful hyperparameter tuning for large datasets?
[0m[1;3;38;2;237;90;200m[docs] Q: What tools or libraries are recommended for performing hyperparameter tuning?
[0mGenerated 4 sub questions.
[1;3;38;2;237;90;200m[docs] Q: What specific strategies are mentioned in the project reports for enhancing training data quality in speech recognition?
[0m[1;3;38;2;90;149;237m[docs] Q: Are there any case studies or examples in the project reports that illustrate successful enhancement of training da

Batch processing of predictions:  20%|██        | 1/5 [11:04<44:19, 664.92s/it]

[1;3;38;2;11;159;203m[docs] A: Based on the provided context information, several techniques are mentioned for pitch shifting in audio augmentation. Some of these techniques include:

1. **Speech perturbations**: This involves randomly modifying the audio signal to simulate different accents or speech patterns.
2. **Time masking**: This technique involves selectively removing specific time intervals from the original audio signal to create a new, modified version with different accent or pitch characteristics.
3. **Pitch shift**: A straightforward method of altering the pitch of an audio signal by changing its frequency.
4. **Background noise injection**: Adding unwanted background noise to an audio signal can also be used as a technique for pitch shifting.

These techniques are mentioned in the context of exploring other strategies for contributing to model fine-tuning and improving accuracy on accent mapping tasks, such as speech recognition models like WER (Word Error Rate).
[0m

Batch processing of predictions:  40%|████      | 2/5 [11:28<14:23, 287.87s/it]

[1;3;38;2;90;149;237m[docs] A: Based on the provided text, there are several examples of self-transcribed, high-confidence data that were utilized in training a model. These include:

1. High-confidence transcriptions from individuals with expertise in the domain being evaluated (e.g., accent variations and speech patterns). For example, Guo et al.'s work mentions using "self-transcribed, high confidence data" to supplement the training data pool for fine-tuning the wav2vec2-large-960h model on the Common Voice dataset.
2. Raw audio recordings of specific speech patterns or accents. In the context of improving model accuracy in identifying accent variations and speech patterns, this could involve transcribing and analyzing raw audio recordings from diverse regions with unique linguistic characteristics (e.g., Singapore vs. Africa).
3. Transcripts of spoken language that demonstrate high confidence levels due to their accuracy and relevance to the task at hand (e.g., detecting specific

Batch processing of predictions:  60%|██████    | 3/5 [11:58<05:40, 170.07s/it]

[1;3;38;2;90;149;237m[docs] A: Yes, there are several case studies or examples mentioned in the project reports that illustrate successful enhancement of training data quality for speech recognition. 

One such example is from the section titled "Training Report – Results, Evaluation and Future Works" on page 9, where it mentions:

"In this study, we compared our fine-tuned model's performance against a pre-trained baseline 'wav2vec2-large-960h' model development set (cv-valid-dev). Key dataset features and results are displayed in Table 1."

Table 1 shows the comparison of WER scores between the fine-tuned "wav2vec2-large-960h" model and the pre-trained "wav2vec2-large-960h" baseline model development set on different datasets.

In another section, titled "3. Limitations, Considerations & Future Works", on page 4, it mentions:

"One key limitation of this project is compute and memory limitations. We were only able to fine-tune our pre-trained 'wav2vec2-large-960h' model on 6,300 aud

Batch processing of predictions:  80%|████████  | 4/5 [12:27<01:54, 114.21s/it]

[1;3;38;2;90;149;237m[docs] A: Based on the provided context information, several methodologies have been proposed or explored for integrating Large Language Models (LLMs) into existing speech recognition systems. Some of these include:

1. **Semi-Supervised Learning Strategies**: Utilizing high-confidence transcriptions to supplement the training data pool and improve model fine-tuning.
2. **External Language Model Integration**: Leveraging pre-trained language models, such as transformer-based models like BERT or RoBERTa, to enhance speech recognition performance.
3. **Hybrid Approaches**: Combining pre-trained models with other components, like audio augmentation or domain adaptation strategies, to achieve better results.
4. **Data Augmentation Techniques**: Applying techniques like speech perturbations, time masking, pitch shift, and background noise injection to increase the diversity of training data.
5. **Domain Adaptation Strategies**: Using domain-specific models or transfer 

Batch processing of predictions: 100%|██████████| 5/5 [12:57<00:00, 155.45s/it]
Batch processing of evaluations:  32%|███▏      | 9/28.0 [00:53<02:28,  7.84s/it]Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-NrMfKyqsGd6aFUijloMM0e8A on tokens per min (TPM): Limit 200000, Used 200000, Requested 631. Please try again in 189ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-NrMfKyqsGd6aFUijloMM0e8A on tokens per min (TPM): Limit 200000, Used 200000, Requested 21854. Please try again in 6.556s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type'

In [17]:
print(benchmark_df)

rag                            base_rag
metrics                                
mean_correctness_score         3.600000
mean_relevancy_score           0.872727
mean_faithfulness_score        0.745455
mean_context_similarity_score  0.908821


#### 2. Building & Evaluating a HyDE Document Retrieval Query Engine
Great! Are there any improvements? **Now, let us evaluate the substitution of conventional similarity search with hypothetical document embedding (HyDE) search~**

Review potential benefits, limitations and "failure cases" of Hypothetical Document Embedding (HyDE) implementation from: https://docs.llamaindex.ai/en/stable/examples/query_transformations/HyDEQueryTransformDemo/

In [None]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from IPython.display import Markdown, display

query_str = "How did Snape support Harry despite being a deatheater? On top of that, how did he hide his allegiance with the order from Voldermort?"
# Use HyDEQueryTransform to generate hypothetical documents for improved document lookup from vector store. 
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

<b>Based on the provided context, here's an explanation of how Snape supported Harry despite being a Death Eater and hidden his allegiance to Dumbledore:

**Supporting Harry:**

1. **Emotional Connection**: Despite being a Death Eater, Snape seems to have developed an emotional connection with Harry, evident from their interactions and the fact that he was head-to-head with Mundungus in an unfamiliar tavern, suggesting a personal stake.
2. **Protective Instincts**: Snape's actions demonstrate a protective instinct towards Harry, as seen when he faked his own death to protect him from Lord Voldemort, and later, when he gave Polyjuice Potion to the Order of the Phoenix, hinting at a desire to safeguard Harry's safety.
3. **Loyalty**: Snape's loyalty to Dumbledore is unwavering, which shows that despite being a Death Eater, he has committed himself to protecting Harry from Voldemort.

**Hiding Allegiance:**

1. **Order of the Phoenix**: Snape is present in the Order of the Phoenix, suggesting his allegiance is not entirely with the Dark Lord.
2. **Protecting Dumbledore**: Snape's actions imply that he may be protecting Dumbledore from harm, which could indicate a hidden allegiance to protect a higher authority or someone Dumbledore trusts.
3. **Voldemort's Influence**: Snape's conversations with Voldemort suggest that he has some level of influence over his former boss, possibly due to their shared past or Snape's desire for revenge against Voldemort.

It is essential to note that the context does not explicitly state Snape's allegiance to Dumbledore, but these points suggest a complex dynamic where Snape is torn between his loyalty to Voldemort and his commitment to protecting Harry.</b>

In [12]:
print(response)

Based on the provided context, here's an explanation of how Snape supported Harry despite being a Death Eater and hidden his allegiance to Dumbledore:

**Supporting Harry:**

1. **Emotional Connection**: Despite being a Death Eater, Snape seems to have developed an emotional connection with Harry, evident from their interactions and the fact that he was head-to-head with Mundungus in an unfamiliar tavern, suggesting a personal stake.
2. **Protective Instincts**: Snape's actions demonstrate a protective instinct towards Harry, as seen when he faked his own death to protect him from Lord Voldemort, and later, when he gave Polyjuice Potion to the Order of the Phoenix, hinting at a desire to safeguard Harry's safety.
3. **Loyalty**: Snape's loyalty to Dumbledore is unwavering, which shows that despite being a Death Eater, he has committed himself to protecting Harry from Voldemort.

**Hiding Allegiance:**

1. **Order of the Phoenix**: Snape is present in the Order of the Phoenix, suggestin

*Now, let's evaluate the effectiveness of HyDE search vs Document Similarity Search for Document Retrieval*

In [10]:
# Run in async mode
nest_asyncio.apply()

from llama_index.core.llama_pack import download_llama_pack
RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

# Instantiate RAG Evaluator - input query engine, evaluation dataset, judge LLM & embeddings model
rag_evaluator = RagEvaluatorPack(
    query_engine=hyde_query_engine, 
    rag_dataset=eval_dataset,
    judge_llm=Settings.llm, #use the same llm that we use to create the dataset to judge
    embed_model=Settings.embed_model
)

benchmark_hyde = await rag_evaluator.arun()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Processing /Users/jinkettyee/Desktop/my_GitHub/great-things/RAG-evaluation/pack
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: llama-index-packs-rag-evaluator
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): started
  Building wheel for llama-index-packs-rag-evaluator (pyproject.toml): finished with status 'done'
  Created wheel for llama-index-packs-rag-evaluator: filename=llama_index_packs_rag_evaluator-0.3.1-py3-none-any.whl size=4935 sha256=dad027ca78706e90f46566970ee58c33a8d05a5fec0ef37e09c13ed24e871a8c
  Stored in directory: /private/var/folders/nr/6b6zx3jn687ghmtz2_2dw_b40000gn/T/pip-ephem-wheel-cache-dfroggg1/wheels/c5/b3/f2/e8724b5fcdbbb7

You should consider upgrading via the '/Users/jinkettyee/.pyenv/versions/great_things/bin/python -m pip install --upgrade pip' command.
Batch processing of predictions: 100%|██████████| 10/10 [05:56<00:00, 35.61s/it]
Batch processing of predictions: 100%|██████████| 10/10 [04:32<00:00, 27.27s/it]
Batch processing of predictions: 100%|██████████| 10/10 [11:46<00:00, 70.64s/it]
Batch processing of predictions: 100%|██████████| 10/10 [04:35<00:00, 27.55s/it]
Batch processing of predictions: 100%|██████████| 10/10 [03:52<00:00, 23.20s/it]
Batch processing of predictions: 100%|██████████| 5/5 [02:43<00:00, 32.69s/it]
Batch processing of evaluations: 100%|██████████| 28/28.0 [01:55<00:00,  4.14s/it]


In [11]:
print(benchmark_hyde)

rag                            base_rag
metrics                                
mean_correctness_score         3.145455
mean_relevancy_score           0.872727
mean_faithfulness_score        0.636364
mean_context_similarity_score  0.953062


*Are there any improvements in evaluation scores (response & retrieval) from HyDE document retrieval?*