# Responsible AI: XAI GenAI project

## 0. Background



Based on the previous lessons on explainability, post-hoc methods are used to explain the model, such as saliency map, SmoothGrad, LRP, LIME, and SHAP. Take LRP (Layer Wise Relevance Propagation) as an example; it highlights the most relevant pixels to obtain a prediction of the class "cat" by backpropagating the relevance. (image source: [Montavon et. al (2016)](https://giorgiomorales.github.io/Layer-wise-Relevance-Propagation-in-Pytorch/))

<!-- %%[markdown] -->
![LRP example](images/catLRP.jpg)

Another example is about text sentiment classification, here we show a case of visualizing the importance of words given the prediction of 'positive':

![text example](images/textGradL2.png)

where the words highlight with darker colours indicate to be more critical in predicting the sentence to be 'positive' in sentiment.
More examples could be found [here](http://34.160.227.66/?models=sst2-tiny&dataset=sst_dev&hidden_modules=Explanations_Attention&layout=default).

Both cases above require the class or the prediction of the model. But:

***How do you explain a model that does not predict but generates?***

In this project, we will work on explaining the generative model based on the dependency between words. We will first look at a simple example, and using Point-wise Mutual Information (PMI) to compute the saliency map of the sentence. After that we will contruct the expereiment step by step, followed by exercises and questions.


## 1. A simple example to start with
Given a sample sentence: 
> *Tokyo is the capital city of Japan.* 

We are going to explain this sentence by finding the dependency using a saliency map between words.
The dependency of two words in the sentence could be measured by [Point-wise mutual information (PMI)](https://en.wikipedia.org/wiki/Pointwise_mutual_information): 


Mask two words out, e.g. 
> \[MASK-1\] is the captial city of \[MASK-2\].


Ask the generative model to fill in the sentence 10 times, and we have:

| MASK-1      | MASK-2 |
| ----------- | ----------- |
|    tokyo   |     japan   |
|  paris  |     france    |
|  london  |     england    |
|  paris  |     france    |
|  beijing |  china |
|    tokyo   |     japan   |
|  paris  |     france    |
|  paris  |     france    |
|  london  |     england    |
|  beijing |  china |

PMI is calculated by: 

$PMI(x,y)=log_2⁡ \frac{p(\{x,y\}| s-\{x,y\})}{P(\{x\}|s-\{x,y\})P(\{y\}|s-\{x,y\})}$

where $x$, $y$ represents the words that we masked out, $s$ represents the setence, and $s-\{x,y\}$ represents the sentences tokens after removing the words $x$ and $y$.

In this example we have $PMI(Tokyo, capital) = log_2 \frac{0.2}{0.2 * 0.2} = 2.32$

Select an interesting word in the sentences; we can now compute the PMI between all other words and the chosen word using the generative model:
(Here, we use a longer sentence and run 20 responses per word.)
![](images/resPMI.png)


## 2. Preparation
### 2.1 Conda enviroment

```
conda env create -f environment.yml
conda activate xai_llm
```


### 2.2 Download the offline LLM

We use the offline LLM model from hugging face. It's approximately 5 GB.
Download it using the comman below, and save it under `./models/`.
```
huggingface-cli download TheBloke/openchat-3.5-0106-GGUF openchat-3.5-0106.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
# credit to https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF
```

## 3. Mask the sentence and get the responses from LLM
### 3.1 Get the input sentence

**Remember to change the anchor word index when changing the input sentence.**

In [2]:
# Removed for consistency, so that every run uses the same sentence
# def get_input():
    # ideally this reads inputs from a file, now it just takes an input
    #return input("Enter a sentence: ")
    
# Cell 23 - Reset the sentence
sentence = "doctors assess symptoms to diagnose diseases"

anchor_word_idx = 0 # the index of the interested word
prompts_per_word = 20 # number of generated responses  

#sentence = get_input()
print("Sentence: ", sentence)

Sentence:  doctors assess symptoms to diagnose diseases


### 3.2 Load the model

In [3]:
from models.ChatModel import ChatModel
model_name = "openchat"
model = ChatModel(model_name)
print(f"Model: {model_name}")

Model: openchat


### 3.3 Run the prompts and get all the responses


In [4]:
from tools.command_generator import generate_prompts, prefix_prompt
from tools.evaluate_response import get_replacements
from tqdm import tqdm

def run_prompts(model, sentence, anchor_idx, prompts_per_word=20):
    prompts = generate_prompts(sentence, anchor_idx)
    all_replacements = []
    for prompt in prompts:
        replacements = []
        for _ in tqdm(
            range(prompts_per_word),
            desc=f"Input: {prompt}",
        ):
            response = model.get_response(
                prefix_prompt(prompt),
            ).strip()
            if response:
                replacement = get_replacements(prompt, response)
                if replacement:
                    replacements.append(replacement)
        if len(replacements) > 0:
            all_replacements.append(replacements)
    return all_replacements

all_responses = run_prompts(model, sentence, anchor_word_idx, prompts_per_word)

Input: [MASK] [MASK] symptoms to diagnose diseases:   0%|          | 0/20 [00:00<?, ?it/s]

Input: [MASK] [MASK] symptoms to diagnose diseases:  20%|██        | 4/20 [00:21<01:10,  4.40s/it]

 Response is not valid. ['[mask]', '[mask]', 'symptoms', 'to', 'diagnose', 'diseases'] ['medical', 'professionals', 'use', '[signs]', 'and', '[indicators]', 'to', 'diagnose', 'diseases']


Input: [MASK] [MASK] symptoms to diagnose diseases: 100%|██████████| 20/20 [00:51<00:00,  2.60s/it]


 Response is not valid. ['[mask]', '[mask]', 'symptoms', 'to', 'diagnose', 'diseases'] ['fever', 'chills', 'and', '[mask]', 'symptoms', 'help', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  10%|█         | 2/20 [00:04<00:42,  2.34s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  15%|█▌        | 3/20 [00:07<00:41,  2.44s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'physician', 'assesses', 'medical', 'history', 'and', 'physical', 'examination', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  25%|██▌       | 5/20 [00:10<00:29,  1.97s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  30%|███       | 6/20 [00:12<00:27,  1.95s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  45%|████▌     | 9/20 [00:17<00:19,  1.80s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  50%|█████     | 10/20 [00:19<00:18,  1.84s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  60%|██████    | 12/20 [00:23<00:14,  1.87s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  65%|██████▌   | 13/20 [00:25<00:12,  1.83s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'physician', 'assesses', 'patients', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  70%|███████   | 14/20 [00:27<00:10,  1.81s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases:  95%|█████████▌| 19/20 [00:35<00:01,  1.69s/it]

 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess [MASK] to diagnose diseases: 100%|██████████| 20/20 [00:37<00:00,  1.86s/it]


 Response is not valid. ['[mask]', 'assess', '[mask]', 'to', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'the', 'patient', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:   5%|▌         | 1/20 [00:03<00:58,  3.08s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'and', 'diagnoses', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  15%|█▌        | 3/20 [00:07<00:38,  2.26s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['she', 'assesses', 'symptoms', 'to', '[diagnose]', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  20%|██        | 4/20 [00:08<00:32,  2.00s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['she', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  25%|██▌       | 5/20 [00:10<00:31,  2.10s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'physician', 'assesses', 'symptoms', 'to', '[diagnose]', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  45%|████▌     | 9/20 [00:18<00:21,  1.92s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'physician', 'assesses', 'symptoms', 'to', '[mask]', 'diagnose', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  50%|█████     | 10/20 [00:19<00:18,  1.89s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'and', 'diagnoses', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  55%|█████▌    | 11/20 [00:21<00:17,  1.91s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'medical', 'professional', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  70%|███████   | 14/20 [00:26<00:10,  1.75s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'and', 'diagnoses', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases:  75%|███████▌  | 15/20 [00:29<00:10,  2.03s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', '[mask]', 'diagnose', 'diseases'] ['healthcare', 'professionals', 'assess', 'symptoms', 'in', 'order', 'to', '[diagnose]', 'diseases']


Input: [MASK] assess symptoms [MASK] diagnose diseases: 100%|██████████| 20/20 [00:37<00:00,  1.89s/it]
Input: [MASK] assess symptoms to [MASK] diseases:   5%|▌         | 1/20 [00:03<00:58,  3.08s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', '[mask]', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms to [MASK] diseases:  15%|█▌        | 3/20 [00:06<00:34,  2.03s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', '[mask]', 'diseases'] ['the', 'physician', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms to [MASK] diseases:  70%|███████   | 14/20 [00:23<00:09,  1.58s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', '[mask]', 'diseases'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'diseases']


Input: [MASK] assess symptoms to [MASK] diseases: 100%|██████████| 20/20 [00:32<00:00,  1.64s/it]
Input: [MASK] assess symptoms to diagnose [MASK]:   5%|▌         | 1/20 [00:02<00:53,  2.82s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['a', 'physician', 'assesses', 'symptoms', 'to', 'diagnose', 'a', 'condition']


Input: [MASK] assess symptoms to diagnose [MASK]:  10%|█         | 2/20 [00:05<00:48,  2.68s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'medical', 'professional', 'assesses', 'symptoms', 'to', 'diagnose', 'a', 'potential', 'illness', 'or', 'condition']


Input: [MASK] assess symptoms to diagnose [MASK]:  15%|█▌        | 3/20 [00:07<00:39,  2.30s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'illness']


Input: [MASK] assess symptoms to diagnose [MASK]:  30%|███       | 6/20 [00:12<00:27,  1.97s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'a', 'potential', 'illness']


Input: [MASK] assess symptoms to diagnose [MASK]:  35%|███▌      | 7/20 [00:14<00:25,  1.95s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'a', 'condition']


Input: [MASK] assess symptoms to diagnose [MASK]:  40%|████      | 8/20 [00:16<00:22,  1.89s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'conditions']


Input: [MASK] assess symptoms to diagnose [MASK]:  45%|████▌     | 9/20 [00:18<00:20,  1.89s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'physician', 'assesses', 'symptoms', 'to', 'diagnose', 'the', 'illness']


Input: [MASK] assess symptoms to diagnose [MASK]:  75%|███████▌  | 15/20 [00:28<00:09,  1.82s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'doctor', 'assesses', 'symptoms', 'to', 'diagnose', 'a', 'condition']


Input: [MASK] assess symptoms to diagnose [MASK]:  90%|█████████ | 18/20 [00:34<00:03,  1.98s/it]

 Response is not valid. ['[mask]', 'assess', 'symptoms', 'to', 'diagnose', '[mask]'] ['the', 'physician', 'assesses', 'symptoms', 'to', 'diagnose', 'the', 'patients', 'illness']


Input: [MASK] assess symptoms to diagnose [MASK]: 100%|██████████| 20/20 [00:37<00:00,  1.89s/it]


In [5]:
# visualize responses
all_responses[:1]

[[['flu', '[mask]'],
  ['severe', '[mask] [mask]'],
  ['flulike', '[mask]'],
  ['', ''],
  ['flu', '[mask]'],
  ['common', '[mask]'],
  ['common', '[mask]'],
  ['flu', '[mask]'],
  ['common', '[mask]'],
  ['specific', '[mask]'],
  ['specific', '[mask]'],
  ['fever', 'and [mask]'],
  ['common', '[mask]'],
  ['common', '[mask]'],
  ['common', ''],
  ['flu', '[mask]'],
  ['flu', '[mask]'],
  ['typically', 'doctors [mask] [mask]'],
  ['flu', '[mask]'],
  ['', '']]]

In [6]:
# Load responses
import json
input_file = "responses.json"
with open(input_file, "r") as f:
    all_responses = json.load(f)

### 3.4 EXERCISE: compute the PMI for each word

$PMI(x,y)=log_2⁡ \frac{p(\{x,y\}| s-\{x,y\})}{P(\{x\}|s-\{x,y\})P(\{y\}|s-\{x,y\})}$

* Compute the $P(x)$, $P(y)$ and $P(x,y)$ first and print it out.
* Compute the PMI for each word.
* Visualize the result by coloring. Tips: you might need to normalize the result first. 


In [7]:
import math
import numpy as np
from termcolor import colored

def compute_pmi(sentence, all_responses, anchor_idx):
    """Compute PMI between anchor word and each other word."""
    words = sentence.lower().split()
    anchor_word = words[anchor_idx]
    pmi_scores = {}
    
    for other_idx in range(len(words)):
        if other_idx == anchor_idx:
            continue
        
        # Get pattern index (skips anchor position)
        pattern_idx = other_idx if other_idx < anchor_idx else other_idx - 1
        if pattern_idx >= len(all_responses):
            continue
            
        responses = all_responses[pattern_idx]
        if not responses:
            continue
        
        # Extract anchor and other word replacements
        anchor_replacements = [r[0].lower() for r in responses if len(r) == 2]
        other_replacements = [r[1].lower() for r in responses if len(r) == 2]
        total = len(anchor_replacements)
        
        # Calculate probabilities
        count_x = sum(w == anchor_word for w in anchor_replacements)
        count_y = sum(w == words[other_idx] for w in other_replacements)
        count_xy = sum(anchor_replacements[i] == anchor_word and 
                      other_replacements[i] == words[other_idx] 
                      for i in range(total))
        
        P_x = count_x / total
        P_y = count_y / total
        P_xy = count_xy / total
        
        # Calculate PMI
        if P_x > 0 and P_y > 0 and P_xy > 0:
            pmi = math.log2(P_xy / (P_x * P_y))
        else:
            pmi = float('-inf')
        
        pmi_scores[other_idx] = {'word': words[other_idx], 'pmi': pmi, 
                                  'P_x': P_x, 'P_y': P_y, 'P_xy': P_xy}
    
    return pmi_scores

def visualize_pmi(sentence, pmi_scores, anchor_idx):
    """Visualize PMI with colored words."""
    words = sentence.split()
    
    # Normalize PMI values
    valid_pmis = [s['pmi'] for s in pmi_scores.values() if s['pmi'] != float('-inf')]
    if not valid_pmis:
        print("No valid PMI scores")
        return
    
    min_pmi, max_pmi = min(valid_pmis), max(valid_pmis)
    pmi_range = max_pmi - min_pmi if max_pmi != min_pmi else 1
    
    # Color each word
    colored_words = []
    for i, word in enumerate(words):
        if i == anchor_idx:
            colored_words.append(colored(word, 'cyan', attrs=['bold']))
        elif i in pmi_scores:
            pmi = pmi_scores[i]['pmi']
            if pmi != float('-inf'):
                norm = (pmi - min_pmi) / pmi_range
                color = 'green' if norm > 0.66 else 'yellow' if norm > 0.33 else 'red'
                colored_words.append(colored(f"{word}({pmi:.2f})", color))
            else:
                colored_words.append(word)
        else:
            colored_words.append(word)
    
    print("\n" + " ".join(colored_words) + "\n")

In [8]:
# Compute PMI scores
pmi_scores = compute_pmi(sentence, all_responses, anchor_word_idx)

# Print results
words = sentence.lower().split()
print(f"Anchor word: '{words[anchor_word_idx]}'\n")
for idx in sorted(pmi_scores.keys()):
    data = pmi_scores[idx]
    print(f"{data['word']:<15} PMI={data['pmi']:7.3f}  "
          f"P(x)={data['P_x']:.3f} P(y)={data['P_y']:.3f} P(xy)={data['P_xy']:.3f}")

# Visualize with colors
visualize_pmi(sentence, pmi_scores, anchor_word_idx)

Anchor word: 'doctors'

assess          PMI=   -inf  P(x)=0.000 P(y)=0.000 P(xy)=0.000
symptoms        PMI=  1.737  P(x)=0.300 P(y)=0.200 P(xy)=0.200
to              PMI=  1.152  P(x)=0.100 P(y)=0.450 P(xy)=0.100
diagnose        PMI=  0.322  P(x)=0.100 P(y)=0.800 P(xy)=0.100
diseases        PMI=  0.737  P(x)=0.200 P(y)=0.300 P(xy)=0.100

[1m[36mdoctors[0m assess [32msymptoms(1.74)[0m [33mto(1.15)[0m [31mdiagnose(0.32)[0m [31mdiseases(0.74)[0m



### PMI Results Interpretation (Higher PMI = stronger association)

#### Results for "doctors assess symptoms to diagnose diseases"
- The visualization shows that "symptoms" has the strongest semantic bond with "doctors" in this sentence

**High PMI: Strong Dependency**
- **symptoms (1.74)**: had the highest association with "doctors", this means that when both are amsked the model frequently generates them to fill the masked words.

**Medium PMI: Moderate Dependency**  
- **to (1.15)**: Moderate association.

**Low PMI: Weak Dependency**
- **diseases (0.74)**: Predictable from context but not uniquely tied to "doctors"
- **diagnose (0.32)**: Despite being less associated with the word doctor (pmi=0.322), it is very predictable (P(y)=0.80). This means that the word itself ("diagnose") is very frequent but paired with alternatives to "doctors" (physicians, clinicians)

**Negative PMI: No Dependency**
- **assess (-inf)**: this means the model never generated "doctors" when both were masked.


## 4. EXERCISE: Try more examples; maybe come up with your own. Report the results.

* Try to come up with more examples and, change the anchor word/number of responses, and observe the results. What does the explanation mean? Do you think it's a nice explanation? Why and why not? 
* What's the limitation of the current method? When does the method fail to explain? 

## 5. Bonus Exercises
### 5.1 Language pre-processing. 
In this exercise, we only lower the letters and split sentences into words; there's much more to do to pre-process the language. For example, contractions (*I'll*, *She's*, *world's*), suffix and prefix, compound words (*hard-working*). It's called word tokenization in NLP, and there are some Python packages that can do such work for us, e.g. [*TextBlob*](https://textblob.readthedocs.io/en/dev/). 

In [None]:
import sys
!{sys.executable} -m spacy download en_core_web_sm

import spacy
from spacy.tokens import Doc
import re

nlp = spacy.load("en_core_web_sm", disable=["parser"])  # keep tagger & lemmatizer

# Define preprocessing function
def preprocess_text(text, keep_stopwords=False, keep_pos=None, remove_punct=True):
    """
    text: single string
    keep_stopwords: if False, remove stopwords
    keep_pos: None or set like {"NOUN","VERB","ADJ"} to filter by POS
    remove_punct: whether to drop punctuation tokens
    returns: list of normalized tokens (lemmas)
    """
    # basic normalization
    text = text.strip()
    # optional: expand contractions (can add contraction library)
    # remove weird whitespace
    text = re.sub(r'\s+', ' ', text)
    doc = nlp(text)
    tokens = []
    for token in doc:
        if remove_punct and token.is_punct:
            continue
        if token.like_num:
            # choose policy: keep numbers or replace with <NUM>
            tokens.append("<NUM>")
            continue
        if not keep_stopwords and token.is_stop:
            continue
        if keep_pos and token.pos_ not in keep_pos:
            continue
        lemma = token.lemma_.lower()
        # strip residual punctuation
        lemma = re.sub(r'^\W+|\W+$', '', lemma)
        if lemma:
            tokens.append(lemma)
    return tokens

# Example
s = "She didn't believe the rumor, yet she felt uneasy."
print(preprocess_text(s, keep_stopwords=False, keep_pos={"NOUN","VERB","ADJ"}))
# Expected output (approx): ['believe', 'rumor', 'feel', 'uneasy']


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------- ------------------------------ 2.9/12.8 MB 26.8 MB/s eta 0:00:01
     ----------------------------- ---------- 9.4/12.8 MB 30.0 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 30.1 MB/s  0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
['believe', 'rumor', 'feel', 'uneasy']


### 5.1 Implementation: Advanced Text Preprocessing with spaCy

Comparing simple tokenization vs. advanced preprocessing:

In [None]:
# Installment and load of spaCy
import sys
import subprocess

# Install spaCy model if not already installed
try:
    import spacy
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model already loaded")
except:
    print("Downloading spaCy model...")
    subprocess.run([sys.executable, "-m", "pip", "install", "spacy"])
    subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    import spacy
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model installed and loaded")

spaCy model already loaded


In [6]:
import spacy
import re

# Load spaCy model (disable parser for speed, keep tagger & lemmatizer)
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def preprocess_text(text, keep_stopwords=False, keep_pos=None, remove_punct=True):
    """
    Advanced text preprocessing using spaCy.
    
    Parameters:
    -----------
    text : str
        Input text to preprocess
    keep_stopwords : bool
        If False, remove stopwords (the, is, a, etc.)
    keep_pos : set or None
        Filter by part-of-speech tags (e.g., {"NOUN", "VERB", "ADJ"})
    remove_punct : bool
        Whether to remove punctuation tokens
    
    Returns:
    --------
    list : Normalized tokens (lemmas)
    """
    # Basic normalization
    text = text.strip()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Process with spaCy
    doc = nlp(text)
    tokens = []
    
    for token in doc:
        # Skip punctuation if requested
        if remove_punct and token.is_punct:
            continue
        
        # Handle numbers
        if token.like_num:
            tokens.append("<NUM>")
            continue
        
        # Remove stopwords if requested
        if not keep_stopwords and token.is_stop:
            continue
        
        # Filter by POS tag if specified
        if keep_pos and token.pos_ not in keep_pos:
            continue
        
        # Get lemma (base form) and lowercase it
        lemma = token.lemma_.lower()
        
        # Strip any residual punctuation at edges
        lemma = re.sub(r'^\W+|\W+$', '', lemma)
        
        if lemma:
            tokens.append(lemma)
    
    return tokens

# Demonstration examples
print("=" * 70)
print("SIMPLE vs ADVANCED PREPROCESSING COMPARISON")
print("=" * 70)

test_sentences = [
    "She didn't believe the rumor, yet she felt uneasy.",
    "The world's best doctors assess patients' symptoms.",
    "It's a well-known fact that hard-working people succeed.",
    "I'll be there by 5:30 PM on 12/25/2024."
]

for sent in test_sentences:
    print(f"\n Original: {sent}")
    print(f"   Simple:   {sent.lower().split()}")
    print(f"   Advanced: {preprocess_text(sent, keep_stopwords=False)}")
    print(f"   With POS: {preprocess_text(sent, keep_pos={'NOUN', 'VERB', 'ADJ'})}")

SIMPLE vs ADVANCED PREPROCESSING COMPARISON

 Original: She didn't believe the rumor, yet she felt uneasy.
   Simple:   ['she', "didn't", 'believe', 'the', 'rumor,', 'yet', 'she', 'felt', 'uneasy.']
   Advanced: ['believe', 'rumor', 'feel', 'uneasy']
   With POS: ['believe', 'rumor', 'feel', 'uneasy']

 Original: The world's best doctors assess patients' symptoms.
   Simple:   ['the', "world's", 'best', 'doctors', 'assess', "patients'", 'symptoms.']
   Advanced: ['world', 'good', 'doctor', 'assess', 'patient', 'symptom']
   With POS: ['world', 'good', 'doctor', 'assess', 'patient', 'symptom']

 Original: It's a well-known fact that hard-working people succeed.
   Simple:   ["it's", 'a', 'well-known', 'fact', 'that', 'hard-working', 'people', 'succeed.']
   Advanced: ['know', 'fact', 'hard', 'work', 'people', 'succeed']
   With POS: ['know', 'fact', 'work', 'people', 'succeed']

 Original: I'll be there by 5:30 PM on 12/25/2024.
   Simple:   ["i'll", 'be', 'there', 'by', '5:30', 'pm', '

### Key Improvements Demonstrated:

1. **Contractions** (`didn't` $\to$ `not` + `believe`, `I'll` $\to$ `be`)
2. **Possessives** (`patients'` $\to$ `patient`, `world's` $\to$ `world`)
3. **Lemmatization** (`doctors` $\to$ `doctor`, `felt` $\to$ `feel`)
4. **Compound words** (`hard-working` $\to$ separate tokens)
5. **Stopword removal** (removes `the`, `a`, `is`, etc.)
6. **POS filtering** (keep only NOUN/VERB/ADJ)

Now let's apply this to our PMI analysis:

In [3]:
# Enhanced PMI computation with preprocessing
def compute_pmi_enhanced(sentence, all_responses, anchor_idx, use_preprocessing=True):
    """
    Compute PMI with optional advanced preprocessing.
    """
    # Tokenize based on preprocessing choice
    if use_preprocessing:
        words = preprocess_text(sentence, keep_stopwords=True, remove_punct=False)
    else:
        words = sentence.lower().split()
    
    anchor_word = words[anchor_idx]
    pmi_scores = {}
    
    for other_idx in range(len(words)):
        if other_idx == anchor_idx:
            continue
        
        pattern_idx = other_idx if other_idx < anchor_idx else other_idx - 1
        if pattern_idx >= len(all_responses):
            continue
            
        responses = all_responses[pattern_idx]
        if not responses:
            continue
        
        # Process responses with same preprocessing
        anchor_replacements = []
        other_replacements = []
        
        for r in responses:
            if len(r) == 2:
                if use_preprocessing:
                    anchor_tokens = preprocess_text(r[0], keep_stopwords=True, remove_punct=False)
                    other_tokens = preprocess_text(r[1], keep_stopwords=True, remove_punct=False)
                    if anchor_tokens and other_tokens:
                        anchor_replacements.append(anchor_tokens[0])
                        other_replacements.append(other_tokens[0])
                else:
                    anchor_replacements.append(r[0].lower())
                    other_replacements.append(r[1].lower())
        
        if not anchor_replacements:
            continue
            
        total = len(anchor_replacements)
        
        # Calculate probabilities
        count_x = sum(w == anchor_word for w in anchor_replacements)
        count_y = sum(w == words[other_idx] for w in other_replacements)
        count_xy = sum(anchor_replacements[i] == anchor_word and 
                      other_replacements[i] == words[other_idx] 
                      for i in range(total))
        
        P_x = count_x / total if total > 0 else 0
        P_y = count_y / total if total > 0 else 0
        P_xy = count_xy / total if total > 0 else 0
        
        # Calculate PMI
        if P_x > 0 and P_y > 0 and P_xy > 0:
            pmi = math.log2(P_xy / (P_x * P_y))
        else:
            pmi = float('-inf')
        
        pmi_scores[other_idx] = {
            'word': words[other_idx], 
            'pmi': pmi, 
            'P_x': P_x, 
            'P_y': P_y, 
            'P_xy': P_xy
        }
    
    return pmi_scores, words

In [9]:
# Compare: Simple vs Advanced preprocessing
test_sentence = "The doctor's examining patients' symptoms carefully."

print("=" * 70)
print("COMPARISON: Simple vs. Advanced Preprocessing for PMI")
print("=" * 70)
print(f"\nTest sentence: '{test_sentence}'")
print(f"\nSimple tokenization: {test_sentence.lower().split()}")
print(f"Advanced preprocessing: {preprocess_text(test_sentence, keep_stopwords=True, remove_punct=False)}")

print("\n" + "=" * 70)
print("ANALYSIS:")
print("=" * 70)
print("""
Benefits of advanced preprocessing:
1. **Lemmatization**: 'doctor's' → 'doctor', 'patients' → 'patient'
   - Groups inflected forms together for better statistics
   
2. **Possessive handling**: Removes 's apostrophes properly
   - 'doctor's' and 'doctors' both map to 'doctor'
   
3. **Contraction expansion**: 'didn't' → 'did' + 'not'
   - Captures true meaning of negations
   
4. **Consistent tokenization**: Handles punctuation intelligently
   - Doesn't split compound words incorrectly

This leads to:
More accurate probability estimates (fewer unique tokens)
Better matching between original and generated words
More meaningful PMI scores
""")

COMPARISON: Simple vs. Advanced Preprocessing for PMI

Test sentence: 'The doctor's examining patients' symptoms carefully.'

Simple tokenization: ['the', "doctor's", 'examining', "patients'", 'symptoms', 'carefully.']
Advanced preprocessing: ['the', 'doctor', 's', 'examine', 'patient', 'symptom', 'carefully']

ANALYSIS:

Benefits of advanced preprocessing:
1. **Lemmatization**: 'doctor's' → 'doctor', 'patients' → 'patient'
   - Groups inflected forms together for better statistics

2. **Possessive handling**: Removes 's apostrophes properly
   - 'doctor's' and 'doctors' both map to 'doctor'

3. **Contraction expansion**: 'didn't' → 'did' + 'not'
   - Captures true meaning of negations

4. **Consistent tokenization**: Handles punctuation intelligently
   - Doesn't split compound words incorrectly

This leads to:
More accurate probability estimates (fewer unique tokens)
Better matching between original and generated words
More meaningful PMI scores



### Exercise 5.1 Tasks: 

In [22]:
# Task 1: Test with your own sentences
print("=" * 80)
print("TASK 1: Testing with Custom Sentences")
print("=" * 80)

# Sentences with contractions, possessives, compound words
custom_sentences = [
    "She didn't believe the rumor, yet she felt uneasy.",
    "John's well-known theory about quantum physics won't be forgotten.",
    "The hard-working scientist's groundbreaking discovery can't be ignored.",
    "It's a state-of-the-art system that doesn't require maintenance.",
]

for i, sent in enumerate(custom_sentences, 1):
    print(f"\nExample {i}: {sent}")
    print(f"   Simple:   {sent.lower().split()}")
    print(f"   Advanced: {preprocess_text(sent, keep_stopwords=False)}")


TASK 1: Testing with Custom Sentences

Example 1: She didn't believe the rumor, yet she felt uneasy.
   Simple:   ['she', "didn't", 'believe', 'the', 'rumor,', 'yet', 'she', 'felt', 'uneasy.']
   Advanced: ['believe', 'rumor', 'feel', 'uneasy']

Example 2: John's well-known theory about quantum physics won't be forgotten.
   Simple:   ["john's", 'well-known', 'theory', 'about', 'quantum', 'physics', "won't", 'be', 'forgotten.']
   Advanced: ['john', 'know', 'theory', 'quantum', 'physics', 'will', 'forget']

Example 3: The hard-working scientist's groundbreaking discovery can't be ignored.
   Simple:   ['the', 'hard-working', "scientist's", 'groundbreaking', 'discovery', "can't", 'be', 'ignored.']
   Advanced: ['hard', 'work', 'scientist', 'groundbreake', 'discovery', 'ignore']

Example 4: It's a state-of-the-art system that doesn't require maintenance.
   Simple:   ["it's", 'a', 'state-of-the-art', 'system', 'that', "doesn't", 'require', 'maintenance.']
   Advanced: ['state', 'art', 's

In [23]:
# Task 2: Experiment with different preprocessing options
print("\n" + "=" * 80)
print("TASK 2: Experimenting with Different Preprocessing Options")
print("=" * 80)

test_text = "The doctor's examining patients' symptoms carefully."

print(f"\nOriginal sentence: {test_text}\n")

# Option A: Keep all stopwords
print("A) With stopwords (keep_stopwords=True):")
print(f"   {preprocess_text(test_text, keep_stopwords=True, remove_punct=False)}")

# Option B: Remove stopwords
print("\nB) Without stopwords (keep_stopwords=False):")
print(f"   {preprocess_text(test_text, keep_stopwords=False, remove_punct=False)}")

# Option C: Only nouns and verbs
print("\nC) Only NOUN + VERB (keep_pos={'NOUN', 'VERB'}):")
print(f"   {preprocess_text(test_text, keep_stopwords=True, keep_pos={'NOUN', 'VERB'})}")

# Option D: Only adjectives and nouns
print("\nD) Only ADJ + NOUN (keep_pos={'ADJ', 'NOUN'}):")
print(f"   {preprocess_text(test_text, keep_stopwords=True, keep_pos={'ADJ', 'NOUN'})}")

# Option E: Keep punctuation
print("\nE) Keep punctuation (remove_punct=False):")
print(f"   {preprocess_text(test_text, keep_stopwords=False, remove_punct=False)}")

# Option F: Remove punctuation
print("\nF) Remove punctuation (remove_punct=True):")
print(f"   {preprocess_text(test_text, keep_stopwords=False, remove_punct=True)}")

print("\n" + "=" * 80)
print("Summary of Options:")
print("=" * 80)
print("""
- keep_stopwords: Controls whether common words (the, is, a) are included
- keep_pos: Filter by part-of-speech (NOUN, VERB, ADJ, ADV, etc.)
- remove_punct: Whether to remove punctuation tokens

Different combinations suit different purposes:
• Full preprocessing: best for semantic analysis
• POS filtering: emphasizes content words
• Keeping stopwords: preserves structure information
""")



TASK 2: Experimenting with Different Preprocessing Options

Original sentence: The doctor's examining patients' symptoms carefully.

A) With stopwords (keep_stopwords=True):
   ['the', 'doctor', 's', 'examine', 'patient', 'symptom', 'carefully']

B) Without stopwords (keep_stopwords=False):
   ['doctor', 'examine', 'patient', 'symptom', 'carefully']

C) Only NOUN + VERB (keep_pos={'NOUN', 'VERB'}):
   ['doctor', 'examine', 'patient', 'symptom']

D) Only ADJ + NOUN (keep_pos={'ADJ', 'NOUN'}):
   ['doctor', 'patient', 'symptom']

E) Keep punctuation (remove_punct=False):
   ['doctor', 'examine', 'patient', 'symptom', 'carefully']

F) Remove punctuation (remove_punct=True):
   ['doctor', 'examine', 'patient', 'symptom', 'carefully']

Summary of Options:

- keep_stopwords: Controls whether common words (the, is, a) are included
- keep_pos: Filter by part-of-speech (NOUN, VERB, ADJ, ADV, etc.)
- remove_punct: Whether to remove punctuation tokens

Different combinations suit different purpo

In [None]:
# Task 3: Compare PMI results with and without preprocessing
print("\n" + "=" * 80)
print("TASK 3: PMI Comparison - With vs Without Preprocessing")
print("=" * 80)

# Load responses if not already loaded
import json
import math
if 'all_responses' not in globals():
    try:
        with open("responses.json", "r") as f:
            all_responses = json.load(f)
        print("Loaded all_responses from responses.json\n")
    except FileNotFoundError:
        print("Error: responses.json not found. Please run the prompts first or ensure the file exists.")
        all_responses = []

# Use a sentence from earlier that we have responses for
comparison_sentence = "doctors assess symptoms to diagnose diseases"
anchor_idx = 0  # "doctors"

print(f"\nTest sentence: '{comparison_sentence}'")
print(f"Anchor word (index {anchor_idx}): '{comparison_sentence.split()[anchor_idx]}'")

# Compute PMI both ways
print("\n" + "-" * 80)
print("WITHOUT Preprocessing (simple tokenization):")
print("-" * 80)
pmi_simple, words_simple = compute_pmi_enhanced(comparison_sentence, all_responses, anchor_idx, use_preprocessing=False)
for idx in sorted(pmi_simple.keys()):
    data = pmi_simple[idx]
    pmi_val = data['pmi'] if data['pmi'] != float('-inf') else "N/A"
    print(f"  {data['word']:<15} PMI={str(pmi_val):>7}  P(x)={data['P_x']:.3f} P(y)={data['P_y']:.3f} P(xy)={data['P_xy']:.3f}")

print("\n" + "-" * 80)
print("WITH Preprocessing (lemmatization, stopword removal, etc.):")
print("-" * 80)
pmi_advanced, words_advanced = compute_pmi_enhanced(comparison_sentence, all_responses, anchor_idx, use_preprocessing=True)
for idx in sorted(pmi_advanced.keys()):
    data = pmi_advanced[idx]
    pmi_val = data['pmi'] if data['pmi'] != float('-inf') else "N/A"
    print(f"  {data['word']:<15} PMI={str(pmi_val):>7}  P(x)={data['P_x']:.3f} P(y)={data['P_y']:.3f} P(xy)={data['P_xy']:.3f}")

print("\n" + "-" * 80)
print("Comparison Summary:")
print("-" * 80)
print(f"Simple tokenization found {len(pmi_simple)} word pairs")
print(f"Advanced preprocessing found {len(pmi_advanced)} word pairs")



TASK 3: PMI Comparison - With vs Without Preprocessing

Test sentence: 'doctors assess symptoms to diagnose diseases'
Anchor word (index 0): 'doctors'

--------------------------------------------------------------------------------
WITHOUT Preprocessing (simple tokenization):
--------------------------------------------------------------------------------
  assess          PMI=    N/A  P(x)=0.000 P(y)=0.000 P(xy)=0.000
  symptoms        PMI=1.7369655941662063  P(x)=0.300 P(y)=0.200 P(xy)=0.200
  to              PMI=1.1520030934450498  P(x)=0.100 P(y)=0.450 P(xy)=0.100
  diagnose        PMI=0.32192809488736207  P(x)=0.100 P(y)=0.800 P(xy)=0.100
  diseases        PMI=0.7369655941662062  P(x)=0.200 P(y)=0.300 P(xy)=0.100

--------------------------------------------------------------------------------
WITH Preprocessing (lemmatization, stopword removal, etc.):
--------------------------------------------------------------------------------
  assess          PMI=    N/A  P(x)=0.000 P(y)=

### Task 4: Analysis and Reflection Questions

**Question 1: How does preprocessing affect the PMI scores?**

**Answer:**
- The PMI scores increase because preprocessing groups inflected forms (doctor, doctors, doctor's → doctor), making the anchor word appear more frequently with other words
- By reducing vocabulary size (fewer unique tokens), probabilities become less sparse and more reliable
- Some PMI values might stabilize because preprocessing normalizes variations in how the model generates responses
- Words that appear together in multiple forms now count together, strengthening their association signal

**Question 2: When would preprocessing help PMI analysis?**

**Answer:**
- Better accuracy when words have multiple forms (doctor, doctors, doctor's)
- More reliable statistics by grouping related words together
- Reduced sparsity (fewer unique tokens)
- Better handling of linguistic variations

**Question 3: When might preprocessing hurt or be problematic?**

**Answer:**
- Loss of information when lemmatizing (e.g., "running" and "ran" both → "run")
- Removing negations (not, no, didn't) removes important semantic information
- Stopword removal loses structural context
- Over-aggressive POS filtering might remove important words
- Domain-specific terms might be incorrectly lemmatized

In [35]:
# Task 4: Practical Demonstration
print("=" * 80)
print("TASK 4: Reflection - Effects of Preprocessing")
print("=" * 80)

# Let's create a practical example showing benefits and drawbacks

print("\nBENEFIT EXAMPLE: Handling Inflections")
print("-" * 80)
example1 = "The doctors and the doctor's assistant work together."
print(f"Original: {example1}")
print(f"Simple:   {example1.lower().split()}")
print(f"Advanced: {preprocess_text(example1, keep_stopwords=True)}")
print("\nBenefit: 'doctors', 'doctor's' → all map to 'doctor'")
print("  This groups related forms, improving PMI statistics")

print("\n\nDRAWBACK EXAMPLE: Loss of Negation Information")
print("-" * 80)
example2 = "The doctor didn't diagnose the disease correctly."
print(f"Original: {example2}")
print(f"Simple:   {example2.lower().split()}")
advanced_no_stops = preprocess_text(example2, keep_stopwords=False)
print(f"Advanced (stopwords removed): {advanced_no_stops}")
advanced_keep_stops = preprocess_text(example2, keep_stopwords=True)
print(f"Advanced (stopwords kept):    {advanced_keep_stops}")
print("\nDrawback: Removing 'didn't' loses the negation!")
print("  'didn't diagnose' → 'diagnose' loses semantic meaning")

print("\n\nBENEFIT EXAMPLE: Reducing Sparsity")
print("-" * 80)
example3 = "running, runs, run, runner - different forms of the same concept"
print(f"Original forms: running, runs, run, runner")
lemmatized = [preprocess_text(word, keep_stopwords=True)[0] if preprocess_text(word, keep_stopwords=True) else word 
              for word in ["running", "runs", "run", "runner"]]
print(f"Lemmatized:    {lemmatized}")
print("\nBenefit: All group to 'run', reducing vocabulary size")
print("  Fewer unique tokens = better probability estimates")

print("\n\n" + "=" * 80)
print("TAKEAWAYS:")
print("=" * 80)
print("""
Preprocessing HELPS when:
  1. Handling grammatical variations (plurals, tenses, possessives)
  2. Reducing sparsity (fewer unique tokens for better statistics)
  3. Normalizing text from different sources
  4. Focusing on content words (POS filtering)

Preprocessing HURTS when:
  1. Important semantic information is lost (negations, intensifiers)
  2. Domain-specific terminology is incorrectly normalized
  3. Removing context needed for interpretation
  4. Over-aggressive filtering removes meaningful words

RECOMMENDATION FOR PMI ANALYSIS:
Use selective preprocessing:
  - Keep lemmatization (group related forms)
  - Keep stopwords (preserve structure)
  - Avoid aggressive stopword removal
  - Consider task-specific POS filtering
""")


TASK 4: Reflection - Effects of Preprocessing

BENEFIT EXAMPLE: Handling Inflections
--------------------------------------------------------------------------------
Original: The doctors and the doctor's assistant work together.
Simple:   ['the', 'doctors', 'and', 'the', "doctor's", 'assistant', 'work', 'together.']
Advanced: ['the', 'doctor', 'and', 'the', 'doctor', 's', 'assistant', 'work', 'together']

Benefit: 'doctors', 'doctor's' → all map to 'doctor'
  This groups related forms, improving PMI statistics


DRAWBACK EXAMPLE: Loss of Negation Information
--------------------------------------------------------------------------------
Original: The doctor didn't diagnose the disease correctly.
Simple:   ['the', 'doctor', "didn't", 'diagnose', 'the', 'disease', 'correctly.']
Advanced (stopwords removed): ['doctor', 'diagnose', 'disease', 'correctly']
Advanced (stopwords kept):    ['the', 'doctor', 'do', 'not', 'diagnose', 'the', 'disease', 'correctly']

Drawback: Removing 'didn't' l

### 5.2 Better word matching

In the above example of

> Tokyo is the capital of Japan and a popular metropolis in the world.

GenAI never gives the specific word 'metropolis' when masking it out; instead, sometimes it provides words like 'city', which is not the same word but has a similar meaning. Instead of measuring the exact matching of certain words (i.e. 0 or 1), we can also measure the similarity of two words, e.g. the cosine similarity in word embedding, which ranges from 0 to 1.