Create 5 examples of word arithmetic similar to the "king - man + woman ≈ queen" analogy. Use words that have relevant semantic relationships. Steps

Load the BERT model and tokenizer Implement functions to get word embeddings and perform word arithmetic. Write word_arithmetic and find_most_similar functions to create your examples The word arithmetic function will be able to take two list of words: ○ The first list is parameters to the word_arithmatic as example, (paris, france, italy), run the arithmetic and collect the return value (e.g., paris - france + italy = ?). ○ Using the find_most_similar function with return value of word_arithmetic as input, along with the second list of words like (rome, romaine, ramania, ronnie, random) to find the most similar word to the answer. ○ Show this for of 5 potential pairs of such words ○ Print answer for each of the 5 test cases.

In [1]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# 2. Get word embeddings
def get_word_embedding(word):
    inputs = tokenizer(word, return_tensors='pt')
    outputs = model(**inputs)
    # Take the embeddings of the [CLS] token
    return outputs.last_hidden_state[0][0].detach().numpy()

# 3. Word arithmetic function
def word_arithmetic(word1, word2, word3):
    embedding1 = get_word_embedding(word1)
    embedding2 = get_word_embedding(word2)
    embedding3 = get_word_embedding(word3)

    # Word arithmetic: word1 - word2 + word3
    result_vector = embedding1 - embedding2 + embedding3
    return result_vector

# 4. Find the most similar word from a list of candidates
def find_most_similar(vector, candidates):
    similarities = []
    for word in candidates:
        candidate_embedding = get_word_embedding(word)
        similarity = cosine_similarity([vector], [candidate_embedding])[0][0]
        similarities.append((word, similarity))
    # Sort by highest similarity
    return sorted(similarities, key=lambda x: x[1], reverse=True)[0][0]

# 5. Test with example pairs
def test_word_arithmetic():
    test_cases = [
        (['teacher', 'school', 'classroom'], ['student', 'principal', 'playground', 'library']),
        (['doctor', 'hospital', 'clinic'], ['patient', 'nurse', 'treatment', 'lady']),
        (['author', 'publisher', 'bookstore'], ['nurse', 'book', 'reader', 'prince']),
        (['chief', 'restaurant', 'kitchen'], ['food', 'recipe', 'salt', 'pendrive']),
        (['artist', 'gallery', 'museum '], ['paint', 'exhibit', 'sun', 'twilight'])
    ]

    for test_input, candidates in test_cases:
        result_vector = word_arithmetic(*test_input)
        most_similar_word = find_most_similar(result_vector, candidates)
        print(f"Word Arithmetic: {test_input[0]} - {test_input[1]} + {test_input[2]} ≈ {most_similar_word}")

# Run the test cases
test_word_arithmetic()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Word Arithmetic: teacher - school + classroom ≈ principal
Word Arithmetic: doctor - hospital + clinic ≈ nurse
Word Arithmetic: author - publisher + bookstore ≈ book
Word Arithmetic: chief - restaurant + kitchen ≈ food
Word Arithmetic: artist - gallery + museum  ≈ exhibit


Implement a simple RAG system using LangChain, process an article of your choice, and run 5 different queries on its content. Steps

1.Choose at least 5 diverse articles on a different topic of your interest from wikipedia dump on HuggingFace (e.g., Artificial Intelligence, Machine Learning, etc.).

2.Use the provided code from the class to load and process each article, create embeddings, store embeddings for each article in the single VectorDB and set up the RAG system.

3.Formulate 10 diverse queries that explore various aspects of your article's content. Run each query using the run_query function and record the results.

In [13]:
!pip install langchain faiss-cpu transformers
!pip install langchain
import langchain



In [14]:
import os
from langchain import LangChain
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFaceLLM

# Initialize HuggingFace Embeddings and LLM
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
llm_model_name = "distilgpt2"

embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
llm = HuggingFaceLLM(model_name=llm_model_name)

# Create a vector store
vector_store = FAISS(embedding_dim=embeddings.get_embedding_dim())

# Load articles (assuming they are text files in the directory "articles")
def load_articles(directory):
    articles = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), 'r') as file:
                articles.append(file.read())
    return articles
articles = load_articles("articles")

# Process articles and add them to the vector store
for article in articles:
    embedding = embeddings.embed(article)
    vector_store.add_item(embedding, article)

# Define the RAG system
rag = LangChain(vector_store=vector_store, llm=llm)

# Formulate queries
queries = [
    "What is Machine Learning and how does it work?",
    "What are the different types of Machine Learning algorithms?",
    "How does supervised learning differ from unsupervised learning?",
    "What is a neural network and how is it used in Machine Learning?",
    "What are some common applications of Machine Learning in healthcare?",
    "What challenges do practitioners face when deploying Machine Learning models in production?",
    "How do hyperparameters affect the performance of Machine Learning models?",
    "What are some recent advancements in Machine Learning research?,
    "How does Machine Learning contribute to autonomous vehicles?"",
    "What ethical considerations are associated with the use of Machine Learning?"
]

# Run queries and record results
def run_queries(rag, queries):
    results = {}
    for query in queries:
        result = rag.run_query(query)
        results[query] = result
    return results

results = run_queries(rag, queries)

# Print results
for query, result in results.items():
    print(f"Query: {query}")
    print(f"Answer: {result}\n")


SyntaxError: unterminated string literal (detected at line 44) (<ipython-input-14-5c19e833ba44>, line 44)

In [None]:
queries1 = [
    "What is deep learning and how does it differ from traditional Machine Learning?"
    "How do convolutional neural networks (CNNs) work and what are they used for?"
    "What are recurrent neural networks (RNNs) and how are they applied in sequence data?"
    "What is transfer learning and how can it be applied to improve model performance?"
    "How do generative adversarial networks (GANs) work and what are their applications?"




]

# Run queries and record results
def run_queries(rag, queries1):
    results = {}
    for query in queries1:
        result = rag.run_query(query)
        results[query] = result
    return results

results = run_queries(rag, queries)

# Print results
for query, result in results.items():
    print(f"Query: {query}")
    print(f"Answer: {result}\n")

In [None]:
queries2 = [
    "How is Machine Learning used in medical diagnosis and prognosis?"
    "What are the benefits and challenges of using Machine Learning for personalized medicine?"
    "How do Machine Learning models contribute to drug discovery and development?"
    "What are the ethical considerations of using Machine Learning in healthcare?"
    "How does Machine Learning enhance medical imaging and analysis?"




]

# Run queries and record results
def run_queries(rag, queries2):
    results = {}
    for query in queries2:
        result = rag.run_query(query)
        results[query] = result
    return results

results = run_queries(rag, queries2)

# Print results
for query, result in results.items():
    print(f"Query: {query}")
    print(f"Answer: {result}\n")

In [10]:
queries3 = [
    "What are the key ethical concerns in Machine Learning and artificial intelligence?"
    "How can bias in Machine Learning models be detected and mitigated?"
    "What is explainable AI (XAI) and why is it important?"
    "How does privacy impact Machine Learning practices and what are some solutions?"
    "What are the implications of AI decision-making on societal fairness?"




]

# Run queries and record results
def run_queries(rag, queries3):
    results = {}
    for query in queries3:
        result = rag.run_query(query)
        results[query] = result
    return results

results = run_queries(rag, queries3)

# Print results
for query, result in results.items():
    print(f"Query: {query}")
    print(f"Answer: {result}\n")

NameError: name 'rag' is not defined