In [None]:
---
title: "Part 2: Learning about embeddings Similarity"
description: "Implementing Semantic Search and K-NN for wikipedia similarity"
author: "Kei Taketsuna"
date: "3/7/2025"
categories:
  - LLMs
  - prompting
  - logic
---

# 🏎️ **Implementing KNN (k-Nearest Neighbors)**
KNN is a **simple** but **powerful** algorithm that:
1. Stores **training data**
2. Finds the **K closest** points to a test sample
3. Assigns the **most common** label among those neighbors

---

## 🏗️ **Building a Simple KNN Class**
This class:
- Stores training data (`X_train`, `y_train`)
- Uses **Euclidean distance** to find nearest neighbors
- Returns the **most common** label among them
*italicized text*

🚀 Part 1: KNN Classifier Playground Goal: Build your own KNN classifier and see decision boundaries evolve!

🧩 Step 1: Implement Simple KNN Help Professor AI finish the KNN class!


## KNN Algorithm

### Initialize SimpleKNN Class
1. Set the number of neighbors (k) with default value 3

### Train the model (fit function)
1. Store the training data points (X_train)
2. Store the corresponding labels (y_train)

### Make predictions (predict function)
1. Initialize empty list for predictions
2. For each test point:
   a. Calculate Euclidean distance to all training points
      - Subtract test point from all training points
      - Square the differences
      - Sum along axis 1
      - Take square root of sum
   b. Find indices of k nearest neighbors
      - Use argpartition function to get k smallest distances
   c. Get labels of k nearest neighbors
   d. Perform majority voting
      - Use Counter to count label occurrences
      - Find most common label
   e. Append most common label to predictions

3. Return predictions as a numpy array

## Helper Functions
1. numpy.sqrt(): Calculate square root
2. numpy.sum(): Sum array elements
3. numpy.argpartition(): Partially sort array to find k smallest elements
4. Counter(): Count occurrences of elements in a list
5. most_common(): Get most common element(s) from Counter object/


<img src="a.png" width=900/>

## 🏦 MNIST Data & KNN Boundaries
We load the **digits** dataset, scale, and use **PCA** to reduce the features to 2 dimensions.
- This helps us **plot** decision boundaries for each digit (0-9).
- We then **draw** lines/regions indicating which digit the KNN would classify a point as.


📊 Visualizing KNN on MNIST Digits Goal: Explore how KNN performs on PCA-reduced MNIST digits!

(Example implementation based on your KNN implementation)

<img src="b.png" width=900/>
<img src="c.png" width=900/>
<img src="d.png" width=900/>
<img src="e.png" width=900/>
<img src="f.png" width=900/>
<img src="g.png" width=900/>

## 🧩 Custom Synthetic Data
Using Gaussian blobs, we generate random clusters to mimic real data.
- Perfect for visualizing how KNN forms boundaries.
- Perfect to test overfitting (small K) vs. underfitting (large K).

## Data Generation and Preprocessing
1. Define function create_realistic_data():
   a. Set cluster properties (position, number of samples, standard deviation)
   b. For each cluster:
      - Generate x and y values using normal distribution
      - Combine x and y values into 2D points
      - Assign cluster ID to each point
   c. Combine all cluster data
   d. Return features (X) and labels (y)

2. Call create_realistic_data() to generate X and y

## Visualization Functions
1. Define function plot_knn_boundary(knn, X, y, title):
   a. Create a figure
   b. Generate a mesh grid for the plot area
   c. Predict classes for each point in the mesh grid
   d. Create a custom colormap
   e. Plot decision regions using contourf
   f. Plot original data points with scatter
   g. Add title, labels, and colorbar
   h. Display the plot

2. Define function plot_accuracy_curve(X, y, max_k):
   a. Split data into training and testing sets
   b. For k from 1 to max_k:
      - Create and train KNN model
      - Make predictions on test set
      - Calculate and store accuracy
   c. Plot k values vs accuracies
   d. Add title, labels, and grid
   e. Display the plot


<img src="h.png" width=900/>
<img src="i.png" width=900/>
<img src="j.png" width=900/>
<img src="k.png" width=900/>
<img src="l.png" width=900/>
<img src="n.png" width=900/>
<img src="o.png" width=900/>
<img src="p.png" width=900/>
<img src="q.png" width=900/>
<img src="r.png" width=900/>

## 🌐 Wikipedia + OpenAI Setup
- We create a `wiki` client to talk to Wikipedia.
- We retrieve the **page content** of each article.
- Then we **embed** that text with `get_embedding`.


🌌 Wikipedia Semantic Explorer Mission: Become an AI librarian finding related articles!

🔑 API Setup Unlock the knowledge vaults

Let's set up the OpenAI API and Wikipedia client.

<img src="ss.png" width=900/>

## 📜 `get_article_text` Function 📜
1. Connects to **Wikipedia** using `wiki.page`.
2. **Checks** if the page exists.
3. **Returns** the page’s text for embedding or analysis.

📚 Fetch Wikipedia Articles
Implement a function to fetch article content.

## Setup and Initialization
1. Initialize Wikipedia API with user agent
2. Define list of article titles to fetch

## Article Fetching and Embedding

### Function: get_article_text(title)
1. Fetch Wikipedia page for given title
2. If page exists, return page text
3. Otherwise, return None

### Function: get_embedding(text, model)
1. Truncate text to 8000 characters
2. Count tokens in truncated text
3. Update total token count for cost tracking
4. Generate embedding using OpenAI API
5. Return embedding vector

### Main Embedding Process
1. Initialize empty lists for embeddings and labels
2. For each article title:
   a. Fetch article text
   b. If text is retrieved:
      - Generate embedding
      - Add embedding to embeddings list
      - Add title to labels list
      - Print success message
   c. If fetch fails, print failure message

## KNN Implementation

### KNN Class Initialization
1. Store embeddings as numpy array
2. Normalize embeddings
3. Store labels

### KNN Query Method
1. Generate embedding for input text
2. Normalize query embedding
3. Calculate similarities using dot product
4. Find indices of top k similar articles
5. Return list of (label, similarity) pairs for top matches

## KNN Usage
1. Initialize KNN with generated embeddings and labels
2. (Ready for querying with new text inputs)

<img src="SA.png" width=900/>
<img src="SB.png" width=900/>

## 🔗 `KNN` Class for Wikipedia Articles 🔗
1. **Holds** embeddings and article labels/titles.
2. On **query**, it:
   - Embeds your query text.
   - Computes **cosine similarity** with all stored article embeddings.
   - Sorts by similarity and returns the top matches.

📚 Build the KNN Class

Implement a KNN class to find similar articles.

<img src="SC.png" width=900/>

## Example 1: Finding Similar Articles

1. Define query text: "Neural networks in machine learning"
2. Use KNN to find top 3 similar articles:
   a. Call knn.query() with query text and k=3
   b. Store results

3. Print "Top Matches:"
4. For each result (title and score):
   a. Print title and formatted score

## Example 2: Comparing Two Specific Articles

1. Define two article titles:
   - article1 = "Artificial Intelligence"
   - article2 = "Quantum Computing"

2. For each article:
   a. Fetch article text using get_article_text()
   b. Generate embedding using get_embedding()

3. Calculate similarity:
   a. Normalize both embeddings
   b. Compute dot product of normalized embeddings

4. Print similarity score between the two articles

<img src="SE.png" width=900/>

📚 Plot Similarity Heatmap

Visualize the similarity matrix.

## Create Similarity Matrix

1. Calculate dot product of embeddings with their transpose:
   similarity_matrix = dot_product(embeddings, transpose(embeddings))

## Visualize Similarity Matrix as Heatmap

1. Create a new figure with size 10x8

2. Generate heatmap:
   a. Use imshow() to display similarity matrix
   b. Set colormap to "viridis"

3. Set x-axis labels:
   a. Use article titles as labels
   b. Set tick positions to range from 0 to number of articles
   c. Rotate labels 90 degrees

4. Set y-axis labels:
   a. Use article titles as labels
   b. Set tick positions to range from 0 to number of articles

5. Add colorbar:
   a. Label it "Cosine Similarity"

6. Set title of the plot to "Wikipedia Article Similarities"

7. Display the plot

<img src="SF.png" width=900/>
<img src="SD.png" width=900/>