# Introduction to Word Embeddings Lab

Welcome to the Word Embeddings Lab, where you will delve into the fascinating world of machine learning and natural language processing (NLP). In this session, we aim to demystify how machines understand and process human language. You'll learn about embeddings, which are a cornerstone in the field of NLP, and you'll see how they can be used to create a semantic search engine.

## What Will You Learn?

- **Word Embeddings**: Understand what word embeddings are and why they are a powerful tool for representing text in a way that captures the meaning and relationships between words.
- **Semantic Search**: Build a semantic search engine that can find relevant articles based on the meaning of a search query, rather than just keyword matching.
- **FAISS**: Get introduced to FAISS (Facebook AI Similarity Search), a library for efficient similarity searching.

## Why Are These Concepts Important?

- **Machine Understanding**: Word embeddings allow computers to process text in a more human-like way, understanding context and semantics.
- **Applicability**: The concepts you learn here are used in a variety of applications, from recommendation systems to automated customer support.
- **Real-world Tools**: You will work with real-world tools that professionals use for machine learning projects, including pre-trained models from Hugging Face's Transformers library.

## Before You Start:

Remember that the field of AI is about experimentation and innovation. Don't be afraid to try new things and ask questions. The goal is to learn and explore, even if things don't work perfectly the first time. Now, let's embark on this journey through the world of AI and NLP together!


### Setup: environment

This cell sets up a conda environment in Google Colab, which allows us to install and manage packages we'll need for our project.

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install() # expect a kernel restart

### Setup: package installs

Here we install several important packages:
- `sentence-transformers`: For working with state-of-the-art sentence embeddings.
- `faiss`: For efficient similarity searches.
- `wikipedia`: To easily access and download Wikipedia articles.
- `pandas`: For organizing and manipulating data.

In [None]:
!mamba install sentence-transformers faiss wikipedia pandas -yq

### Setup: package imports

After installing the necessary packages, we import them into our notebook. This gives us the tools we need to start working on our machine learning project.

In [None]:
import wikipedia
import sentence_transformers
import faiss
import numpy
import transformers

## Preparing the Embedding Model

This section disables logging for Transformers to keep the output clean. We then initialize a pre-trained model for generating embeddings.

In [None]:
# Set Transformers' logging to error only to suppress download messages
transformers.logging.set_verbosity_error()

# Prepare an embedding model
model = sentence_transformers.SentenceTransformer("intfloat/e5-small-v2")

## Fetching and Indexing Articles

Here we define two functions:
- `get_articles_by_topic`: To fetch and preprocess Wikipedia articles.
- `create_index`: To create a FAISS index with the articles' embeddings.

In [None]:
def get_articles_by_topic(topics):
    # Step 1: Fetch articles
    articles = {topic: wikipedia.page(topic).content for topic in topics}

    # Step 2: Preprocess text
    # (assuming simple preprocessing for demonstration)
    processed_articles = {
        title: content.replace("\n", " ") for title, content in articles.items()
    }
    return processed_articles

# Prepare a function to create a new index
def create_index(passages, model, instruction="passage"):
    if instruction:
        passages = [
            f"{instruction}: {passage}" for passage in passages
        ]
    # Step 3: Generate embeddings
    embeddings = [
        model.encode(content, normalize_embeddings=True)
        for content in passages
    ]

    # Step 4: Indexing with FAISS
    # Get the size of the embeddings
    dimension = (
        embeddings[0].shape[0]
    )
    # Use the "distance" for the index
    index = faiss.IndexFlatIP(dimension)

    # You need to convert the embeddings dictionary to a list of embeddings
    embeddings_matrix = numpy.array(embeddings)
    index.add(embeddings_matrix)  # Add embeddings to the index

    # return the results
    return index

## Fetching and Indexing Articles

We'll fetch some articles by topic and create an index for them.

In [None]:
topics = [
    "Earth",
    "Computer Science",
    "Artificial Intelligence",
    "Python (programming language)",
    "Leonardo da Vinci",
    "Eiffel Tower",
]
articles = get_articles_by_topic(topics)
index = create_index(articles, model)

## Semantic Search Function

In this part, we implement the semantic search function. It allows us to search our indexed articles with a natural language query.

In [None]:
# Step 5: Semantic search
def search(query, model, k=3, instruction="query"):
    """
    Search for relevant articles given a query.
    Some models need a special instruction (e.g. "query: ")
    """
    # Need to embed the query
    if instruction:
        query = f"{instruction}: {query}"
    query_embedding = model.encode(query)
    # k=3 finds the 3 closest article
    distances, indices = index.search(numpy.array([query_embedding]), k=k)
    return distances, indices

# Step 6: Present results
query = "I want to learn about historical landmarks in Europe"
distances, indices = search(query, model)

for i, idx in enumerate(indices[0]):
    print(f"Article title: {list(articles.keys())[idx]}")
    print(f"Distance: {distances[0][i]}")
    print(f"Snippet: {articles[list(articles.keys())[idx]][:100]}...")  # Display the first 100 characters

## Exploring Semantic Search with New Topics

Now that you have learned how to create a semantic search engine using word embeddings and FAISS, it's time to put your new skills into practice!

### Your Challenge:

1. **Select New Topics**: Choose 10-15 new topics that interest you. These can be anything from your favorite sport, a historical figure you admire, to a science concept you're curious about.

2. **Fetch and Index**: Use the `get_articles_by_topic` function to fetch the Wikipedia articles for your chosen topics and then create a new index using the `create_index` function.

3. **Craft Your Search Queries**: Think about what you want to learn from these articles. Formulate 3-5 search queries that reflect your interests or questions.

4. **Search and Discover**: Use the `search` function with your queries to see which articles are most relevant to your questions. Examine the results and see if the articles answer your questions or if they lead to new questions.

5. **Reflect and Share**: After you perform your searches, take some time to reflect on the results.
   - Were the articles what you expected?
   - Did you find the information you were looking for?
   - Share your findings and insights with the class.

This is your opportunity to explore the vast knowledge contained in Wikipedia using the power of AI. Have fun searching!

### Tips for Success:

- Be specific with your search queries to get the best results.
- If your first search doesn't return what you expected, try rephrasing your query or choosing different keywords.
- Remember that the way you phrase your query can greatly influence the search results.

Happy Exploring!
