# Words by Numbers: Exploring Vector Similarity 
This notebook accompanies our readings on AI, to help explore words of interest as "vector embeddings". As we are learning from our readings, in the making of a large language model, they consist of calculations about meaningful units of language (words / tokenized particles) and they calculate relationships based on mathematical vectors. 

In this digital humanities orientation to language models and word embeddings, we'll reach for the number values ("vector word embedding values") that language models apply to word-tokens when they read and generate them.

*My first version of this notebook is [this Python file](https://github.com/newtfire/textAnalysis-Hub/blob/main/Class-Examples/Python/readFileCollections-examples/readingFileCollection.py) also stored in our textAnalysis-Hub Python collection.

## Pay attention to an interesting word: A little lab test of LLMs
We'll just do this as a test case based on some text file outputs from any generative AI. (You can also adapt this script to explore your own text corpora for your projects.) For our demonstration, we'll run a little experiment on generative AI chatbots (ChatGPT, Claude, Gemini, etc). We can give a couple of these bots the same prompt, and save the outputs in a directory. Then we'll run this notebook over them.

When you review outputs to your prompts, do you see a **word or idea** that's sometimes repeated or that seems an interesting basis for comparison across the files? Take your cue from this: Choose a word of interest that you think is worth "paying attention to" as a basis for comparing the text outputs of the AI (or the texts in your collection). In this notebook, we'll use it as a basis for comparing all the files: **how often do they use a similar word**, that is, a "similar" word based on a language model's calculations of similarity? 

For our experiment, we'll work with spaCy's freely available large language model, to access its vector database on words and return calculations. Python can "crunch the numbers" to show us which prompts are using words that are mapped to a nearby "vector space" in spaCy's language model. This Python script can help show us how our texts compare to each other based on how often they draw from words that share similar number values. 

For this experiment we'll be accessing the **cosine similarity** values that LLMs reach for in calculating a response to a prompt. The only difference is, we're using spaCy's embedding dictionary as our basis for exploring. (Our little experiment is simply using spaCy's vector embeddings as our similarity measuring instrument. To take this further, we could try to get "under the hood" of each LLM to access its distinctive vector database.)

## Installs and imports
* spacy
* spacy's large language model
* os for handling filepaths


In [1]:
import spacy
import os

# I NEED THE SPACY DOWNLOAD FOR THE LANGUAGE MODEL HERE.
nlp = spacy.load('en_core_web_lg')
# COMMENT OUT THE spacy.load line after the first time you load your model!
# If the large one is too big, you can use the medium:
# nlp = spacy.load('en_core_web_md')
# en_core_web_md: size: 34 MB )
# en_core_web_lg (Lots of data here, size: 400 MB.) Note that the LARGEST one will have the most data and probably be the most reliable.


### Identify the directory to read
So, you have a collection of text files that you want to compare, and you've stored them in a directory(folder). 
Which directory do you want the Python script to read? Define it in relation to where you saved your Python file.

In [2]:
collection = 'texts'
# This is a folder saved in the same directory with this Python notebook.

In [3]:
# "Smoke test": Open the directory and make sure we can access the files inside. 
# Here's where we use os.
def readTextFiles(filepath):
    with open(filepath, 'r', encoding='utf8') as f:
    # ebb: add that utf8 encoding argument to the open() function to ensure that reading works on everyone's systems
    # this all succeeds if you see the text of your files printed in the console.
        readFile = f.read()
        # print(readFile)
        stringFile = str(readFile)
        lengthFile = len(readFile)
        # (Literally, we'll just count the number of characters in this.)
        print(lengthFile)

for file in os.listdir(collection):
    if file.endswith(".txt"):
        filepath = f"{collection}/{file}"
        print(filepath)
        readTextFiles(filepath)

texts/breakfast.txt
129108
texts/sixteen.txt
160864


### Apply spaCy's language model 

* First we'll have spaCy tokenize and "read" the files, just using its default tokenizer.
* Then we'll choose our **word of interest** that we'll use as a basis for exploring similarity.
* We'll then create a **dictionary of highly similar values** to that word, drawn from spaCy's vector database.

The similarity value is called "cosine similarity" and it varies between 0 and 1, with 1 being closest to identical.
Try adjusting the "similarity gage" by tweaking the number in the second if statement: 

```py
            if wordOfInterest.similarity(token) > .3:
```

How does changing this value affect the results you see?

 

In [4]:
def readTextFiles(filepath):
    with open(filepath, 'r', encoding='utf8') as f:
        readFile = f.read()
        # print(readFile)
        stringFile = str(readFile)

        tokens = nlp(stringFile)
        # playing with vectors here
        vectors = tokens.vector
        # Want to "see" some vector information? Uncomment the  next line:
        print(vectors)

for file in os.listdir(collection):
    if file.endswith(".txt"):
        filepath = f"{collection}/{file}"
        readTextFiles(filepath)

[-4.76245359e-02  1.65148228e-01 -1.30200148e-01 -6.64106682e-02
  7.63094574e-02 -2.28473730e-03  2.82227322e-02 -1.21189274e-01
 -2.70061214e-02  1.62962532e+00 -1.58736631e-01  3.13014351e-02
  7.03500435e-02 -7.40286261e-02 -1.44381449e-01 -1.31716784e-02
 -5.05471118e-02  7.91358948e-01 -1.38889983e-01  1.25758920e-03
  5.12803271e-02 -3.47130606e-03 -1.85405444e-02 -4.05540168e-02
  5.12745045e-03  1.28774270e-02 -7.51902759e-02 -3.09352856e-02
  8.77762660e-02 -7.69974962e-02 -3.15146483e-02  1.04475848e-01
 -5.69244958e-02  8.06982890e-02  7.45510459e-02 -3.45903635e-02
  2.38528363e-02  8.77986550e-02 -8.15537274e-02 -3.20069753e-02
  1.21239629e-02  1.82465259e-02 -8.29399843e-03 -7.07079396e-02
  5.02320901e-02  7.79762045e-02 -1.44033179e-01 -3.41775194e-02
  2.57057603e-02  2.46572141e-02 -1.00684064e-02  5.07621877e-02
  1.30882338e-02 -2.85084583e-02  1.76655333e-02  9.58173815e-03
 -1.47783756e-02 -3.90408561e-02  3.27310339e-02 -6.26947954e-02
 -6.78112507e-02 -5.24519

In [5]:
def readTextFiles(filepath):
    with open(filepath, 'r', encoding='utf8') as f:
        readFile = f.read()
        # print(readFile)
        stringFile = str(readFile)

        tokens = nlp(stringFile)
        # playing with vectors here
        vectors = tokens.vector
        # Want to "see" some vector information? Uncomment the  next line:
        # print(vectors)

        wordOfInterest = nlp(u'panic')

         # Now, let's open an empty dictionary! We'll fill it up with the for loop just after it.
        # The for-loop goes over each token and gets its values
        highSimilarityDict = {}
        for token in tokens:
            if(token and token.vector_norm):
                if wordOfInterest.similarity(token) > .4:
                # ^^^^ our "similarity gage" ^^^^
                    highSimilarityDict[token] = wordOfInterest.similarity(token)
                    # The line above creates the structure for each entry in my dictionary.
                        # print(token.text, "about this much similar to", wordOfInterest, ": ", wordOfInterest.similarity(token))
        print(f'This is a dictionary of words most similar to the word "{wordOfInterest.text}" in "{file}".')
        print(highSimilarityDict)
        print('\n')

        

for file in os.listdir(collection):
    if file.endswith(".txt"):
        filepath = f"{collection}/{file}"
        readTextFiles(filepath)

This is a dictionary of words most similar to the word "panic" in "breakfast.txt".
{crazy: 0.4205722212791443, brain: 0.41149720549583435, sort: 0.41327935457229614, upset: 0.4269835650920868, angry: 0.44234710931777954, stop: 0.4402228891849518, confusion: 0.5925239324569702, mistake: 0.4031905233860016, sleep: 0.5002427697181702, when: 0.43749767541885376, mess: 0.40713033080101013, utter: 0.4334380626678467, confusion: 0.5925239324569702, embarrassed: 0.40131354331970215, uncomfortable: 0.4029877781867981, ignore: 0.4248170852661133, happening: 0.40955767035484314, when: 0.43749767541885376, angry: 0.44234710931777954, temper: 0.42683684825897217, nervous: 0.5641590356826782, pain: 0.5249786972999573, pain: 0.5249786972999573, ignore: 0.4248170852661133, ignore: 0.4248170852661133, upset: 0.4269835650920868, moment: 0.44734853506088257, emotions: 0.493652880191803, hurt: 0.41069528460502625, afraid: 0.49458327889442444, afraid: 0.49458327889442444, situation: 0.47141218185424805, fe

#### "Dedupe" and sort the dictionary
The results above give you duplicate words based on how often they appear in the text. That's information we might want to count, but still just have the word appear once. We might also want to sort the values from most to least similar to our word of interest.

So we'll try to do these things in the next cell. The **Counter** function from the collections library will be super helpful here because it takes distinct values AND counts how often a value appears in the text. 


In [6]:
from collections import Counter

def readTextFiles(filepath):
    with open(filepath, 'r', encoding='utf8') as f:
        readFile = f.read()
        # print(readFile)
        stringFile = str(readFile)

        tokens = nlp(stringFile)
        # playing with vectors here
        vectors = tokens.vector
        # Want to "see" some vector information? Uncomment the  next line:
        # print(vectors)

        wordOfInterest = nlp(u'school')

        # Our structures for storing similarity data:
        # This time we'll add counts, 
        # and we'll use Python's sorted() function to sort by similarity values from high (close to 1) to low (close to 0):
        highSimilarityDict = {}
        sorted_similarity = sorted(highSimilarityDict.items(), key=lambda item: item[1], reverse=True)
        wordCounts = Counter()
        
        for token in tokens:
            if(token and token.vector_norm):
                if wordOfInterest.similarity(token) > .4:
                    wordCounts[token.text] += 1
                # ^^^^ our "similarity gage" ^^^^
                    highSimilarityDict[token] = wordOfInterest.similarity(token)
                    # The line above creates the structure for each entry in my dictionary.
                        # print(token.text, "about this much similar to", wordOfInterest, ": ", wordOfInterest.similarity(token))
        # Smoke test for the Counter(): 
        # for word, count in list(wordCounts.items())[:5]:
        #    print(f"'{word}': {count}")
        print(f'This is a dictionary of words most similar to the word "{wordOfInterest.text}" in "{file}".')
        for word, similarity in highSimilarityDict.items():
            count = wordCounts[word.text]
            print(f"{word}: similarity={similarity:.3f}, count={count}")
            # The `:3f` above basically just reduces the decimal places to three (as in 3 floating points). 
            # print(f"{word}: similarity={similarity}, count={count}")
        print('\n')
 

for file in os.listdir(collection):
    if file.endswith(".txt"):
        filepath = f"{collection}/{file}"
        readTextFiles(filepath)


This is a dictionary of words most similar to the word "school" in "breakfast.txt".
Club: similarity=0.431, count=3
children: similarity=0.534, count=3
they: similarity=0.410, count=23
They: similarity=0.410, count=19
they: similarity=0.410, count=23
going: similarity=0.450, count=15
HIGH: similarity=0.542, count=1
SCHOOL: similarity=1.000, count=1
DAY: similarity=0.440, count=43
school: similarity=1.000, count=18
High: similarity=0.542, count=1
School: similarity=1.000, count=2
But: similarity=0.405, count=7
who: similarity=0.426, count=12
care: similarity=0.414, count=9
way: similarity=0.416, count=16
DAY: similarity=0.440, count=43
her: similarity=0.420, count=65
up: similarity=0.423, count=83
class: similarity=0.534, count=1
go: similarity=0.440, count=36
day: similarity=0.440, count=4
her: similarity=0.420, count=65
up: similarity=0.423, count=83
school: similarity=1.000, count=18
DAY: similarity=0.440, count=43
there: similarity=0.425, count=33
He: similarity=0.426, count=59
firs

## Your Turn! 

Adapt this code in your own Python file or notebook to explore similarity values based on your own collection of text files. Is this of interest in your project?