# NLP similarity and methods - first exploration of the vector space. 

## Content: 
- Readme
- Setup, tests
- Comprehensions
- On the numerical representation of natural language
- Bag of word
- [Dot product](#dot-product)
- Euclidean distance
- Length Normalization
- Cosine similarity
- TF-IDF
- Mini project: Finding the most similar document using cosine similarity and TF-IDF
- About, credits, where to learn more, and so on. 

In [13]:
from IPython.display import display, HTML

display(HTML('''
<style>
  .MathJax_Display, .MathJax {
    font-size: 250% !important;
  }
</style>
'''))

# thanks to chatgpt for the css 


# Readme

This is the first notebook of what I hope to be a series of notebooks, covering the curriculum of a course at UiO, in2110. 

In this notebook, we'll go through some of the basic concepts of algoirthms, and after that, we'll end with a small final project, demonstrating the algoritms.


# Setup and tests

## Requirements.txt

anyio==4.9.0
argon2-cffi==25.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==25.3.0
babel==2.17.0
beautifulsoup4==4.13.4
bleach==6.2.0
certifi==2025.4.26
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
comm==0.2.2
contourpy==1.3.2
cycler==0.12.1
debugpy==1.8.14
decorator==5.2.1
defusedxml==0.7.1
executing==2.2.0
fastjsonschema==2.21.1
fonttools==4.58.1
fqdn==1.5.1
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
idna==3.10
iniconfig==2.1.0
ipykernel==6.29.5
ipython==9.3.0
ipython_pygments_lexers==1.1.1
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.6
joblib==1.5.1
json5==0.12.0
jsonpointer==3.0.0
jsonschema==4.24.0
jsonschema-specifications==2025.4.1
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyter_server==2.16.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.3
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
kiwisolver==1.4.8
MarkupSafe==3.0.2
matplotlib==3.10.3
matplotlib-inline==0.1.7
mistune==3.1.3
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
nltk==3.9.1
notebook==7.4.3
notebook_shim==0.2.4
numpy==2.2.6
overrides==7.7.0
packaging==25.0
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pillow==11.2.1
platformdirs==4.3.8
pluggy==1.6.0
prometheus_client==0.22.1
prompt_toolkit==3.0.51
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pycparser==2.22
Pygments==2.19.1
pyparsing==3.2.3
pytest==8.3.5
python-dateutil==2.9.0.post0
python-json-logger==3.3.0
PyYAML==6.0.2
pyzmq==26.4.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.25.1
scikit-learn==1.7.0
scipy==1.15.3
Send2Trash==1.8.3
setuptools==80.9.0
six==1.17.0
sniffio==1.3.1
soupsieve==2.7
stack-data==0.6.3
terminado==0.18.1
threadpoolctl==3.6.0
tinycss2==1.4.0
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
types-python-dateutil==2.9.0.20250516
typing_extensions==4.14.0
uri-template==1.3.0
urllib3==2.4.0
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
wordcloud==1.9.4

## How to run
## Folder structure

# Comprehensions


# On the numerical representation of natural language, and what is a vector anyways? 


# Bag of words

# Dot-product

The forumla for finding the dot product is the following: 

$a \cdot b =\sum_{i=0}^{n - 1}(a_ib_i)$


In other words, for each feature in vector a and b, take the sum of the product of feature i in vector a with feature i in vector b from index 0 to the last index of the vectors. (Where we count index 0 as the first index of a vector.) The algorithm assumes that the vectors are of equal length. 

Example: 

a = [1, 2, 3]

b = [2, 2, 2]

dot product = ((1 * 2) + (2 * 2) + (3 * 2) ) = 12

Example 2: 

a = [1, 1, 7]

b = [2, 3, 6]

dot product = ((1 * 2) + (1 * 3) + (7 * 6) ) = 47


Example 3: 

a = [0]

b = [1]

dot product = 0 * 1 = 0


Here is an implementation in Python, meant to be readable. 


In [9]:
def dot_product(vector1: list[float], vector2 : list[float]) -> float :
    """
    A method for finding the dot product of two vectors.
    
    Args: 
        vector1 (list[float]): a list representing a vector.
        vector2 (list[float]): a list representing a different vector. 
        
    Returns: 
        sum (float): the sum of the calculation.
        
    Raises: 
        ValueError: If the vectors are not of the same length. 
    """
    
    if len(vector1) != len(vector2):
        raise ValueError("Vectors must be of the same length")
        
    
    total = 0 
    for v1, v2 in zip(vector1, vector2): 
        total += (v1 * v2)
    return total

print(f"Dot product, [1, 2, 3], [2, 2, 2] = {dot_product([1, 2, 3], [2, 2, 2])}" )
print(f"Dot product, [1, 1, 7], [2, 3, 6] = {dot_product([1, 1, 7], [2, 3, 6])}" )
print(f"Dot product, [0], [1] = {dot_product([0], [1])}" )






Dot product, [1, 2, 3], [2, 2, 2] = 12
Dot product, [1, 1, 7], [2, 3, 6] = 47
Dot product, [0], [1] = 0


# Euclidean distance

# Length normalization

# Cosine similarity  

# TF-IDF

# Mini project

# About, credits and so on. 