# 🧠 NLP Foundations Workshop: Vector Space Proximity

### 🔹 Introduction to Vector Space Proximity

A large majority of the data on the Internet is **unstructured**, for example: social media posts, emails, images, videos and audio files.

If we want to **persist** all these media in a database, we may add **metadata** about them, such as file type or creation date timestamp, or we could  **tag** each file, or parts of it, so they are easy to search for. This is because it would be very difficult to identify them based on their low-level (byte) representations.

But, what if we want to make the process fully automated (i.e., remove the need to manually add features, like tags, to each media item)? We need another way to represent the semantics of digital media.

That is the reason why in **Information Retrieval (IR)** and **Natural Language Processing (NLP)**, we often represent documents and queries as **vectors** in a **high-dimensional space**, where:

* Each **dimension** corresponds to a **unique term** in the vocabulary.
* A **document** is represented by a **point** or a **vector** in the space.
* A **vector** is a list of weights (e.g., term frequencies, TF-IDF values) that describe the presence or importance of terms in a document or query.

---

#### 📘 Example 1: "Rich" and "Poor" Axes

![Vector Space Example: "Rich" and "Poor" Axes"](./images/Fig1_CartesianVectorSpace.png)

Suppose our vocabulary only has two terms:

* `"rich"`
* `"poor"`

These two terms define a **2D Cartesian space**:

* The **x-axis** corresponds to the term **"rich"**.
* The **y-axis** corresponds to the term **"poor"**.

Each document is represented as a vector in this space:

* A document with many occurrences of “poor” and none of “rich” lies near the **y-axis**.
* A document that mentions both “rich” and “poor” lies in the **first quadrant**.
* A document with only “rich” is aligned along the **x-axis**.

The **query vector** $q = \{\text{"rich"}, \text{"poor"}\}$ points in the direction of interest for the search engine.

### 🔹 Euclidean Distance and Its Limitations

One might assume we can measure similarity using **Euclidean distance**:

$$
\text{Euclidean}( \vec{q}, \vec{d} ) = \sqrt{ \sum_{i=1}^{n} (q_i - d_i)^2 }
$$

However, this has problems in practice:

* If document $d_2$ contains more occurrences of both “rich” and “poor” than the query, its vector will have a **longer length**.
* As seen in the diagram, even though $d_2$ has strong content overlap with the query $q$, it may still be **further away** in Euclidean terms than unrelated documents like $d_3$.
* This happens because **magnitude dominates**, not direction.

### 🔹 Angle as Similarity → Cosine Similarity

To solve this, we focus on **vector direction**, not length. We measure **angle** between the document and query vectors using **Cosine Similarity**:

$$
\cos(\vec{q}, \vec{d}) = \frac{ \vec{q} \cdot \vec{d} }{ \|\vec{q}\| \cdot \|\vec{d}\| }
= \frac{ \sum_{i=1}^{n} q_i \cdot d_i }{ \sqrt{ \sum_{i=1}^{n} q_i^2 } \cdot \sqrt{ \sum_{i=1}^{n} d_i^2 } }
$$

* This gives us a similarity score from **0 (orthogonal)** to **1 (identical direction)**.
* Longer documents that are semantically aligned still get **high similarity**.

### 🔹 Why Cosine Similarity Works Better

* **Angle** captures **semantic alignment**.
* It is **not affected** by document length or repetition.
* Example: duplicating document $d$ to make $d'$ will increase Euclidean distance — but **cosine similarity remains 1**.

Cosine similarity is at the core of:

* **Search ranking**
* **Embedding-based retrieval**
* **LLM scoring and attention mechanisms**

Sample code:

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Define the documents and the query
documents = [
    "Ranks of starving poets swell",       # d1
    "Rich poor gap grows",                 # d2
    "Record baseball salaries in 2010"     # d3
]

query = ["rich poor"]                     # q

# Create a CountVectorizer to convert text to term frequency vectors
vectorizer = CountVectorizer()
doc_vectors = vectorizer.fit_transform(documents + query).toarray()

# Separate vectors
doc_matrix = doc_vectors[:3]  # d1, d2, d3
query_vector = doc_vectors[3].reshape(1, -1)  # q

# Compute cosine similarity
cosine_similarities = cosine_similarity(query_vector, doc_matrix).flatten()

# Create a DataFrame to show results
df = pd.DataFrame({
    'Document': ['Doc1', 'Doc2', 'Doc3'],
    'Cosine Similarity with Query': cosine_similarities
})

# Sort for clarity
print("Query: ", query)
df.sort_values(by='Cosine Similarity with Query', ascending=False, inplace=True)
df.reset_index(drop=True, inplace=True)

# Display the result
df


Query:  ['rich poor']


Unnamed: 0,Document,Cosine Similarity with Query
0,Doc2,0.707107
1,Doc1,0.0
2,Doc3,0.0


### 📘 Example 2: Word Vectors in a Small Corpus

Let's start with a small corpus of just six words, each represented by a vector in 3D space:

```plaintext
CAT     → [ 0.2, -0.4,  0.7]
DOG     → [ 0.6,  0.1,  0.5]
APPLE   → [ 0.8, -0.2, -0.3]
ORANGE  → [ 0.7, -0.1, -0.6]
HAPPY   → [-0.5,  0.9,  0.2]
SAD     → [ 0.4, -0.7, -0.5]
```

Each term is represented by a **vector in 3D space**.

### 🔍 Observations

- Words with **similar meanings** tend to have **similar vector representations**.
  - For example, **APPLE** and **ORANGE** are close in vector space, reflecting their semantic similarity.

- Words with **opposite meanings** tend to have **vectors pointing in opposite directions**.
  - For instance, **HAPPY** and **SAD** have contrasting vectors, indicating their opposing emotional tones.

![3D Visualizationof Word Vectors](./images/Fig2_3DVisualizationWordVectors.png)



Vector representations are also called **Embeddings**.

There are several approaaches to how **word embedding methods** generate effective vector representations. 

One of them is **frequency-based embeddings**, word representations that are derived from the frequency of words in a corpus. They are based on the idea that the **importance** or the **significance** of a word can be inferred from **how frequently it occurs in the text**. One such embedding is called **Term Frequency - Inverse Document Frequency** or **TF-IDF**. 

TF-IDF highlights words that are frequent within a specific document but are rare across the entire corpus. For example, in a document about music, it would emphasize words such as **rap**, **disco**, **pop**, **rock**. On the other hand, pronouns would receive a low TF-IDF score.

There are various models for generating word embeddings.

### 🔹 Word Embeddings with Word2Vec

**Word2Vec** is one of the most influential models for learning **dense vector representations** of words, also known as **embeddings**.

Unlike frequency-based models like TF-IDF, Word2Vec uses a **neural network** to learn word vectors such that **similar words have similar embeddings**.

There are two main architectures:

* **CBOW (Continuous Bag of Words)**: Predicts a word from its context.
* **Skip-gram**: Predicts context words from a target word.

Both approaches rely on the **distributional hypothesis**: words that appear in similar contexts tend to have similar meanings.

### 💻 Code Challenge: Learn Word Embeddings Using Word2Vec

#### 🚀 Your Task:

Write Python code that:

1. Prepares a small corpus of tokenized sentences.
2. Trains a **Word2Vec** model on this corpus using Gensim.
3. Displays the vector representation for a few words.
4. Finds the most similar words to a chosen term.

#### 📚 Hints:

* Use `from gensim.models import Word2Vec`
* Tokenize your corpus as a list of word lists (sentences).
* Try: `model.wv['word']`, `model.wv.most_similar('word')`

**Example Questions to Explore:**

* What is the shape of a word vector?
* Which words are closest to "learning", "data", or "model"?
* Can Word2Vec capture analogies (e.g., "king" - "man" + "woman")?

Try it out and see what your model learns! 🎯

#### 🚀 Solution:

In [2]:
# Word2Vec Code Challenge
from gensim.models import Word2Vec

# 1. Prepare a small corpus of tokenized sentences
corpus = [
    ['machine', 'learning', 'is', 'fun'],
    ['deep', 'learning', 'models', 'are', 'powerful'],
    ['data', 'science', 'uses', 'models'],
    ['word', 'embeddings', 'capture', 'semantics'],
    ['neural', 'networks', 'learn', 'representations'],
    ['model', 'accuracy', 'depends', 'on', 'data'],
    ['king', 'queen', 'man', 'woman'],
    ['python', 'is', 'popular', 'for', 'data', 'science']
]

# 2. Train a Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=20, window=3, min_count=1, sg=1, seed=42)

# 3. Display vector representations for a few words
words_to_show = ['learning', 'data', 'model', 'king', 'queen']
for word in words_to_show:
    print(f'Vector for :', word)
    print(model.wv[word])
    print('Shape:', model.wv[word].shape)
    print('-'*40)

# 4. Find most similar words to a chosen term
chosen_terms = ['learning', 'data', 'model', 'king']
for term in chosen_terms:
    print(f'Most similar to :')
    print(model.wv.most_similar(term))
    print('-'*40)

# 5. Analogy: king - man + woman
print('Analogy: king - man + woman = ?')
print(model.wv.most_similar(positive=['king', 'woman'], negative=['man']))

Vector for : learning
[ 5.4432148e-05 -1.2916273e-02 -3.1752426e-02  4.2684775e-02
  2.8163580e-02  1.4394796e-02 -9.7623421e-03  3.2266378e-02
  4.5307353e-03 -5.6793080e-03 -4.9702777e-03 -2.7299756e-02
 -4.0785067e-02  5.4709227e-03  3.8802378e-02 -4.3621615e-02
  3.5840165e-02  3.2752793e-02 -2.2315878e-02  1.3174340e-02]
Shape: (20,)
----------------------------------------
Vector for : data
[-0.04108218  0.0274296   0.01548314 -0.00610298 -0.00672426  0.03585006
 -0.0414423   0.01973772 -0.02986624 -0.04060877  0.00266196  0.0475366
  0.0235936   0.02612126  0.02178242  0.02863825  0.00130401 -0.03724516
  0.03397556 -0.00496576]
Shape: (20,)
----------------------------------------
Vector for : model
[ 4.9803518e-02  2.7741602e-02 -1.7956838e-02  4.7178898e-02
 -7.2493975e-05  8.0637117e-05 -6.6362475e-03 -3.5607822e-02
  4.3438919e-02 -4.8612114e-02 -3.2571662e-02 -2.7037278e-02
 -2.2066751e-02 -3.6817949e-02 -4.8198365e-02  1.7767431e-02
  4.6933107e-02 -3.7817143e-02 -2.03442

🧠 What's Really Going On?
When you train a Word2Vec model:

* Each word becomes a vector of real numbers.
* These vectors live in a multi-dimensional space (in your case, 20D).
* The proximity (closeness) between words in that space reflects similarity of meaning or context.

📦 Example: Word Vector Breakdown
Here's one word vector you saw:

**Vector for 'learning':**
**[ 5.44e-05, -0.0129, -0.0317, ..., 0.0131 ]**

* This is a 20-dimensional vector: each number is a "coordinate" of the word "learning" in this abstract space.
* We can't visualize 20D, but you can imagine each vector like an arrow pointing to a specific location in space.
* Similar words will have vectors pointing in similar directions.

**Vector for 'king':**
**[-0.02120683  0.03294286 -0.02494869  0.02968171 -0.00369923 -0.02673593... ]**

* This is a 20-dimensional vector: each number is a "coordinate" of the word "king" in this abstract space.
* We can't visualize 20D, but you can imagine each vector like an arrow pointing to a specific location in space.
* Similar words will have vectors pointing in similar directions.

Now, right after the vector for 'king' we can see:

**[('semantics', 0.36475855112075806), ('depends', 0.3190699517726898), ('science', 0.308027058839798), ('powerful', 0.28583985567092896), ('data', 0.25938111543655396), ('machine', 0.23754160106182098), ('representations', 0.19705672562122345), ('queen', 0.18593956530094147), ('is', 0.1213439330458641), ('man', 0.10342966020107269)]**

This is a list of the top 10 words in your Word2Vec model whose vectors are closest to the word **king**

* `semantics` is the most similar (0.365)
* `queen` is the 8th most similar (0.186)
* `man` is the 10th most similar (0.0103)

🎯 Visual Analogy (2D Simplified)

Finally, we see:

**Analogy: king - man + woman = ?**
[('fun', 0.35740363597869873), ('semantics', 0.35546138882637024), ('depends', 0.35479703545570374), ('data', 0.31317758560180664), ('science', 0.20852580666542053), ('powerful', 0.17998428642749786), ('are', 0.1563640832901001), ('machine', 0.10168661922216415), ('python', 0.061376530677080154), ('queen', 0.05523523688316345)]

This is a list of the top 10 words in your Word2Vec model whose vectors are closest to the result of the analogy.

Let’s simplify it to 2D for illustration:

```plaintext

                ▲  "queen"
                |
                |
         "king" ●
                |       ● "woman"
                |     ●
                |   ●  "fun"
                | ●
  --------------●----------------▶
             "man"             (some other word)
```

In the Vector Arithmetic associated with `king - man + woman ≈ ?`

[('fun', 0.35740363597869873), ('semantics', 0.35546138882637024), ('depends', 0.35479703545570374), ('data', 0.31317758560180664), ('science', 0.20852580666542053), ('powerful', 0.17998428642749786), ('are', 0.1563640832901001), ('machine', 0.10168661922216415), ('python', 0.061376530677080154), ('queen', 0.05523523688316345)]

Each tuple is:

* A word from your vocabulary (e.g., 'fun')
* Its cosine similarity score with the resulting vector (e.g., 0.357)
* So 'fun' is the most similar word to the computed vector, but 'queen' is only ranked 10th, with a low similarity of 0.055.

> Why is "queen" not at the top?

* Your corpus is small, so the model doesn't have enough context to truly learn that "king" and "queen" are related.
* On a larger corpus, you'd likely see queen rank first.

🤔 What Does king - man + woman ≈ ? Really Mean?

* It’s a semantic equation based on the idea:
* "king is to man" as "queen is to woman"

In vector form:

**vec("king") - vec("man") + vec("woman") ≈ vec("queen")**


> This directional arithmetic is the magic behind Word2Vec's analogy power.

🔍 Why is the word 'king' represented as a 20-dimensional vector? Why not 10 dimensions?

Because we explicitly set the number of dimensions when we trained the Word2Vec model:
```Python
model = Word2Vec(sentences=corpus, vector_size=20, ...)
```

* The vector_size=20 parameter defines the dimensionality of the word embeddings.
* This is not learned from data — it’s a hyperparameter you choose before training.
* You could have chosen vector_size=10, 50, 100, etc.

🤔 How Do You Choose the Right Number?
Smaller vector sizes (like 10 or 20) → faster training, but may lose semantic richness.

Larger sizes (like 100 or 300) → capture more subtle patterns, but need more data and risk overfitting on small corpora.

Common defaults:

* vector_size=100 for small-medium corpora
* vector_size=300 is often used in pretrained embeddings (like Google News)

### 📝 Interpreting the Word2Vec Output

- **Word Vectors:** The printed vectors for words like 'learning', 'data', 'model', 'king', and 'queen' are dense arrays of numbers. Each vector captures semantic properties learned from the context in the corpus.
- **Shape:** The shape of each vector (e.g., 20) matches the 'vector_size' parameter used in training. This is the dimensionality of the embedding space.
- **Most Similar Words:** The 'most_similar' results show which words are closest in meaning to the chosen term, based on their vector proximity. For example, words similar to 'learning' might include 'deep', 'machine', or 'models'.
- **Analogy (king - man + woman):** This classic test checks if the model can capture analogies. If successful, the top result should be 'queen', showing that the model understands gender relationships in the vector space.

Word2Vec embeddings are powerful for capturing semantic relationships, enabling tasks like similarity search, clustering, and analogical reasoning in NLP applications.

### 🧠 Use case: Word2Vec with Wikipedia Text

We now extend our Word2Vec use case by training on real-world text from a **Wikipedia article**. This lets us explore word embeddings on a richer vocabulary and more meaningful context.

### 🔄 Process:

1. **Download**: Load raw text from Wikipedia using `wikipedia` Python library.
2. **Preprocess**: Clean and tokenize sentences.
3. **Train**: Build a Word2Vec model on the processed corpus.
4. **Query**:
   - Similarity between **"king"** and **"queen"**
   - Most similar words to **"computer"**
   - Analogy: **"paris" - "france" + "germany" ≈ ?**

This demonstrates how embeddings capture context and relationships — crucial for tasks like search, question-answering, and even LLMs.

> You may need to install `wikipedia` with `pip install wikipedia`.


In [3]:
# Install dependencies (uncomment the line below if not installed)
# !pip install wikipedia nltk gensim

import wikipedia
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import Word2Vec
import re
import os

# 1. Download Wikipedia text
wiki_title = "Artificial intelligence"
# Fetch the content of the Wikipedia page: https://en.wikipedia.org/wiki/Artificial_intelligence
wiki_text = wikipedia.page(wiki_title).content

# 2. Preprocess
# Ensure 'punkt' is available and nltk_data path is set
nltk_data_path = os.path.join(os.getcwd(), 'nltk_data')
print("Downloading 'punkt' tokenizer...")
nltk.download('punkt', download_dir=nltk_data_path, force=True)
print("Downloading 'punkt_tab' tokenizer...")
nltk.download('punkt_tab', download_dir=nltk_data_path, force=True)

# Always append the custom nltk_data path (if not already present)
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

# Debugging paths and contents
print("NLTK Data Paths:", nltk.data.path)
print("Contents of nltk_data:", os.listdir(nltk_data_path))

# Ensure wiki_text is not empty and properly encoded
if not wiki_text.strip():
    raise ValueError("The Wikipedia page content is empty. Please check the page title or internet connection.")
try:
    wiki_text = wiki_text.encode('utf-8').decode('utf-8')
except UnicodeDecodeError:
    raise ValueError("The Wikipedia page content contains invalid characters and cannot be processed.")

# Split wiki_text into smaller chunks to avoid tokenization issues
text_chunks = [wiki_text[i:i+10000] for i in range(0, len(wiki_text), 10000)]
sentences = []
for chunk in text_chunks:
    sentences.extend(sent_tokenize(chunk))

tokenized_corpus = []

for sentence in sentences:
    sentence = re.sub(r'[^a-zA-Z]', ' ', sentence)
    tokens = word_tokenize(sentence.lower())
    tokens = [word for word in tokens if len(word) > 2]
    tokenized_corpus.append(tokens)

# 3. Train Word2Vec
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=2, sg=1, seed=42)

# 4. Run three example queries
print("🔍 Similarity between 'learning' and 'reasoning':")
print(model.wv.similarity('learning', 'reasoning'))

print("\n📚 Most similar to 'perception':")
print(model.wv.most_similar('perception'))

print("\n🏙️ Analogy: knowledge - human + robotics ≈ ?")
print(model.wv.most_similar(positive=['knowledge', 'robotics'], negative=['human']))

print("\n📚 Most similar to 'Meta':")
# note that we searched for 'meta' in lowercase
# to match the tokenization process
# otherwise it would not find 'Meta' in the model because this word is not 
# present in the vocabulary of the trained Word2Vec model. This can happen for several reasons:
# During preprocessing, the text is converted to lowercase, and non-alphabetic characters are removed. 
# As a result, 'Meta' becomes 'meta'.
print(model.wv.most_similar('meta')) 


Downloading 'punkt' tokenizer...


[nltk_data] Downloading package punkt to c:\StudentWork\Code\PROG8245\
[nltk_data]     IRBasics_VectorSpaceProximity\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Downloading 'punkt_tab' tokenizer...


[nltk_data] Downloading package punkt_tab to c:\StudentWork\Code\PROG8
[nltk_data]     245\IRBasics_VectorSpaceProximity\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


NLTK Data Paths: ['C:\\Users\\Eespinosa/nltk_data', 'c:\\StudentWork\\Code\\PROG8245\\IRBasics_VectorSpaceProximity\\.venv\\nltk_data', 'c:\\StudentWork\\Code\\PROG8245\\IRBasics_VectorSpaceProximity\\.venv\\share\\nltk_data', 'c:\\StudentWork\\Code\\PROG8245\\IRBasics_VectorSpaceProximity\\.venv\\lib\\nltk_data', 'C:\\Users\\Eespinosa\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data', 'c:\\StudentWork\\Code\\PROG8245\\IRBasics_VectorSpaceProximity\\nltk_data']
Contents of nltk_data: ['tokenizers']
🔍 Similarity between 'learning' and 'reasoning':
0.9976537

📚 Most similar to 'perception':
[('human', 0.9983024001121521), ('google', 0.9980690479278564), ('systems', 0.9980460405349731), ('may', 0.9980018138885498), ('was', 0.9979568719863892), ('has', 0.997938334941864), ('such', 0.9979318380355835), ('trained', 0.9979142546653748), ('some', 0.9979134798049927), ('because', 0.9978792071342468)]

🏙️ Analogy: knowledge - human + robotics ≈ ?
[('real', 0.995150

### 🧠 More Use Cases associated with Vector Space Proximity

- Long term memory for LLMs.
- Semantic Search: based on the meaning or context.
- Similarity search for text, images, audio or video data.
- Ranking and/or recommendation engine.


#### 🧠 Vector Proximity in Recommendation Engines

Recommendation systems use vector representations of user preferences and item characteristics. In collaborative filtering, each item (e.g., a movie, product, song) is embedded in a vector space based on how users have interacted with it. The proximity (often measured using cosine similarity) between items or between users is used to infer what a user might like next.

💡 Key Idea: Items close to those already liked (in vector space) are likely to be good recommendations.

### 🧭 Example – Travel Assistant for Bus Schedules

You're building a travel assistant for a call center at a travel agency. Customers ask natural language questions like:

* "What time does the next bus to Toronto leave?"
* "Is there an evening bus to Toronto?"
* "What time is the first bus to Toronto tomorrow?"

These questions may be phrased differently but carry similar **semantic intent**. We use **vector space proximity** (via word embeddings) to match these queries to predefined responses.

#### 🔧 Other Required Components:

* **Predefined Knowledge Base (KB)**: Text entries describing bus schedules (e.g., `"Next bus to Toronto is at 5:00 PM"`).
* **Embeddings Layer**: Use Word2Vec or Sentence Transformers to represent both queries and KB entries as vectors.
* **Similarity Scoring**: Compute cosine similarity to rank possible answers.
* **Optional Enhancements**: Named entity recognition (NER), datetime parsing, or LLM for context.

---

### 🔢 Code: Train Embeddings & Prepare Responses

In [4]:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample knowledge base: schedule answers
responses = [
    "The next bus to Toronto leaves at 5:00 PM.",
    "There is an evening bus to Toronto at 7:30 PM.",
    "The first bus to Toronto tomorrow departs at 6:15 AM.",
    "No buses are available after 10:00 PM.",
    "Toronto-bound buses run every 2 hours."
]

# Preprocessing and training corpus
corpus = [resp.lower().replace(".", "").split() for resp in responses]
model = Word2Vec(sentences=corpus, vector_size=50, window=2, min_count=1, sg=1, seed=42)

# Get sentence embeddings: average of word vectors
def sentence_vector(sentence):
    tokens = sentence.lower().replace(".", "").split()
    vecs = [model.wv[word] for word in tokens if word in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

# Embed all responses
response_vectors = np.array([sentence_vector(r) for r in responses])


### 🧭 Matching User Queries via Cosine Similarity

When a user asks a question like "What time is the next bus to Toronto?", we embed the question and compare it against all stored responses.

In [5]:
# User query (can try different phrasings)
user_query = "Is there an evening bus to Toronto?"

# Embed query and compare to stored response vectors
query_vec = sentence_vector(user_query)
similarities = cosine_similarity([query_vec], response_vectors)[0]

# Rank responses
ranked_indices = similarities.argsort()[::-1]
print("User Query:", user_query)
print("\nBest matches:")
for i in ranked_indices[:3]:
    print(f"- ({similarities[i]:.2f}) {responses[i]}")


User Query: Is there an evening bus to Toronto?

Best matches:
- (0.74) There is an evening bus to Toronto at 7:30 PM.
- (0.34) The first bus to Toronto tomorrow departs at 6:15 AM.
- (0.21) The next bus to Toronto leaves at 5:00 PM.


### 🧭 High-Level System Architecture

Here's how vector space proximity fits in a working travel assistant:

1. **Input Interface**: Accepts speech or text input from user.
2. **Preprocessing Layer**:

   * Tokenization
   * Lowercasing / Cleaning
   * Optional: Named entity recognition (extract city, time)
3. **Embedding Layer**:

   * Vectorize the user query
   * Vectorize all KB entries (if not already done)
4. **Similarity Scoring**:

   * Compute cosine similarity
   * Rank answers based on score
5. **Response Selection**:

   * Return highest-ranked response
   * Optionally add confidence threshold or fallback

> 🧠 This method enables **semantic matching**: It can detect that "evening" and "7:30 PM" refer to the same idea — even though the word "evening" is not explicitly in the response.

### 🧭 Extension: Building a Semantic Chatbot for Travel Queries
We now extend our bus schedule assistant into a chatbot that:

* Accepts user input via a loop or function
* Converts questions into vector embeddings
* Matches them with high-proximity knowledge base responses
* Responds intelligently, even if user phrasing varies

This is a rule-based chatbot with semantic matching — no LLM involved yet.

In [6]:
def get_best_response(user_input, response_texts, response_vecs, model):
    def sentence_vector(sentence):
        tokens = sentence.lower().replace(".", "").split()
        vecs = [model.wv[word] for word in tokens if word in model.wv]
        return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

    user_vec = sentence_vector(user_input)
    scores = cosine_similarity([user_vec], response_vecs)[0]
    top_index = scores.argmax()
    return response_texts[top_index], scores[top_index]


### 🧭 Running the Chatbot in a Loop

We now simulate a basic chatbot interface in a terminal-style loop. The user types a query, the assistant finds the best-matching predefined response using cosine similarity.

This demonstrates vector space proximity in action in a conversational system.

In [8]:
print("🚌 Travel Assistant Bot (Jupyter Edition)")
print("Ask about bus schedules to Toronto. Type 'exit' to quit.\n")

# Optional: limit to 10 turns for demo purposes
max_turns = 10
turn = 0

while turn < max_turns:
    user_input = input("You: ")
    if user_input.strip().lower() in ["exit", "quit"]:
        print("Bot: Safe travels! 👋")
        break

    reply, score = get_best_response(user_input, responses, response_vectors, model)
    print(f"Bot: {reply} (similarity: {score:.2f})\n")
    
    turn += 1



🚌 Travel Assistant Bot (Jupyter Edition)
Ask about bus schedules to Toronto. Type 'exit' to quit.

Bot: The first bus to Toronto tomorrow departs at 6:15 AM. (similarity: 0.57)

Bot: There is an evening bus to Toronto at 7:30 PM. (similarity: 0.60)

Bot: The next bus to Toronto leaves at 5:00 PM. (similarity: 0.00)

Bot: Safe travels! 👋


### 🧭 Reflection – What We Just Built
✅ We created a semantic chatbot that understands meaning, not just keywords.

🔍 Key Concepts Reinforced:

* Vector Space Proximity (semantic similarity using cosine distance)
* Word Embeddings (Word2Vec)
* Response ranking via vector similarity
* Stateless rule-based chatbot logic

⚙️ What’s Next:

* Add NER using spaCy to extract locations and times.
* Replace Word2Vec with Sentence Transformers (BERT-style embeddings).
* Wrap the chatbot into a web app using Flask or Streamlit.

### 🧠 LangChain + OpenAI for Travel Assistant

LangChain is a framework for developing application powered by Large Language Models (LLMs). 
It was designed and implemented to be:
- Data-aware: connecting a language model to other sources of data.
- Agentic: allowing a model to interact with its environment.

LangChain allows you to build applications using Large Language Models (LLMs) like OpenAI. In this example, we’re building a simple chatbot that answers bus-related questions using prompt engineering.

We'll use:

* LangChain to manage prompt flow.
* OpenAI as the LLM provider.
* A basic retriever for context (optional in simple cases).

📦 Step 1: Import and Set Up API Key

In [69]:
import os
from dotenv import load_dotenv
import os

# Load environment variables from a .env file
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
# Check that it's being read correctly
print("API Key loaded:", api_key is not None)

# 🔐 Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")


API Key loaded: True


🤖 Step 2: Design the Prompt Template

Define a system message (behavior) and construct a loop to accept user input. The bot will behave like a travel assistant who only answers bus schedule questions.

In [70]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# LLM: Set temperature to 0 for more deterministic outputs
chat = ChatOpenAI(temperature=0)

system_prompt = SystemMessage(
    content="You are a helpful travel assistant. Answer only questions about buses from Waterloo to Toronto. Be concise and friendly. If the question is unrelated, politely say you can only answer bus schedule questions."
)


💬 Step 3: Run the Chatbot Loop

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
import openai
import os
from dotenv import load_dotenv
from IPython.display import display, Markdown

try:
# 🔐 Load API key from .env file
    load_dotenv()
    api_key = os.getenv("OPENAI_API_KEY")
    print("Loading OpenAI API key...", api_key[:8] + "..." if api_key else "None found")
    client = openai.OpenAI(api_key=api_key)
    print("Loaded OpenAI API key...")
except Exception as e:
    print("❌ Failed to initialize OpenAI client:", str(e))

# 🤖 Define system prompt
SYSTEM_PROMPT = (
    "You are a helpful travel assistant. Answer only questions about buses from "
    "Waterloo to Toronto. Be concise and friendly. If the question is unrelated, "
    "politely say you can only answer bus schedule questions."
)

# 💬 Function to send prompt
def query_chatgpt(prompt):
    try:
        print("🤖 Sending prompt to ChatGPT...")
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"⚠️ Sorry, something went wrong: {str(e)}"

# 💬 Start chat loop
display(Markdown("### 🚌 Travel Assistant Bot\nAsk about buses from Waterloo to Toronto. Type `exit` to stop."))

while True:
    try:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit"]:
            print("Bot: Safe travels! 👋")
            break
        else:
            print(f"\nYou asked: {user_input}")
            print("\n🤖 Bot is thinking...\n")
            reply = query_chatgpt(user_input)
            print(f"Bot: {reply}\n")
    except KeyboardInterrupt:
        print("\nBot: Conversation ended manually. Goodbye!")
        break
    except Exception as e:
        print(f"⚠️ Unexpected error: {str(e)}\nContinuing...\n")


### 🧭 Reflection – Should I Still Use Word2Vec?
✅ We've built a chatbot using Word2Vec and vector proximity, but now let's **extend it using LangChain** and **LLMs (e.g., OpenAI)**.

With LangChain, we gain access to:
- A powerful language model to **understand nuanced questions**
- Memory and chaining capabilities for **context-aware conversation**
- Seamless integration with APIs, tools, and vector databases

🔁 We'll **still use vector embeddings** for semantic filtering, but **LLMs will generate natural responses**.

| Feature         | Word2Vec                         | LLMs (e.g., GPT-4 via LangChain)             |
|----------------|----------------------------------|----------------------------------------------|
| Speed          | ✅ Very fast                     | ❌ Slower due to API latency                 |
| Interpretability| ✅ Vectors are analyzable       | ❌ Harder to trace how answers are formed    |
| Context        | ❌ No awareness of conversation  | ✅ Handles multi-turn dialogue well          |
| Output Fluency | ❌ Simple string matching        | ✅ Fluent and human-like                     |
| Cost           | ✅ Free (local)                  | ❌ API costs apply (OpenAI)                  |

🧩 Use both together: Word2Vec to filter or retrieve, LLMs to respond.