# NLP similarity and methods - first exploration of the vector space. 

## Content: 
- Readme
- Setup, tests
- Tokenization
- Comprehensions
- On the numerical representation of natural language
- Bag of word
- [Dot product](#dot-product)
- Euclidean distance
- Length Normalization
- Cosine similarity
- TF-IDF
- Mini project: Finding the most similar document using cosine similarity and TF-IDF
- Mini project revisited: doing the same, just with professional libraries.
- About, credits, where to learn more, and so on. 

In [None]:
# Run this cell to make the math formulas larger.

from IPython.display import display, HTML

display(HTML('''
<style>
  .MathJax_Display, .MathJax {
    font-size: 250% !important;
  }
</style>
'''))

# thanks to chatgpt for the css 


# Readme

This is the first notebook of what I hope to be a series of notebooks, covering the curriculum of a course at UiO, in2110. 

In this notebook, we'll go through some of the basic concepts of algoirthms, and after that, we'll end with a small final project, demonstrating the algoritms.

The final project comes in two different versions, one using my own implementations and one using more professional libraries. The reason is that this will both teach me how to implement the algoirthms myself, but also how to use the standard libraries for these kind of tasks properly. 


# Setup and tests

## Requirements.txt
See requirements.txt

## Folder structure
TODO 

## How to run

- Create venv (recommended)

```bash
python -m venv NLP-venv
source ./NLP-venv/bin/actiavte
pip install -r requirements.txt

```
Install jupyer notebook: 
Follow instructions here: <link>

run jupyer notebook NLP-notebook.ipynb from root 


# Tokenization

## What is a token? 
- How many tokens is "New York" or "Celine Dion"? 
- Are puncts tokens or part of tokens? 
- Stop words 


## Tokens, types, lemmas 

## Ways of tokenizing

- split()
- using re
- using nltk
- writing your own: a bit hard. 



# Comprehensions

Comprehensions are a way to create lists, sets, dictionaries and generators in a more pythonic and concise way. 

The basic syntax is [expression for variable in iterable if condition]


While comprehensions aren't really a part of NLP in itself, it is so commonly used, both in my code and in others, that I 
think that it should be included in this notebook.

Example: 


In [14]:
names = ["Bob", "Lars", "Celine"]

# lower only names beginning with B, include all names. 
names = [name.lower() if name[0] == "B" else name for name in names]
print(names)




['bob', 'Lars', 'Celine']


In [6]:
# Flip a dictionary: 

phone_book = {

    "Bob": 12345678,
    "Lars" : 22222222, 
    "Celine" : 33333333
    
}

number_to_name_dict = {number : name for name, number in phone_book.items()}

print(number_to_name_dict)

# notice that there is a possibility of losing data her, if two names are linked to the same number. 

{12345678: 'Bob', 22222222: 'Lars', 33333333: 'Celine'}


In [8]:
# Create a set from a list: 

fruit_list = ["apple", "apple", "banana"]

fruit_set = {fruit for fruit in fruit_list}
print(fruit_set)

# notice the different brackets and how they affect the type of the comprehension. 

{'apple', 'banana'}


In [13]:
# You can also do nested comprehensions, for instance to flatten a matrix. 

matrix = [[1, 2, 3],
          [2, 3, 4], 
          [3, 4, 5]]

flat_matrix = [v for vector in matrix for v in vector]
print(f"flat matrix: {flat_matrix}")

# or again, to return a set from a matrix: 

flat_set = {v for vector in matrix for v in vector}
print(f"flat set: {flat_set}")


flat matrix: [1, 2, 3, 2, 3, 4, 3, 4, 5]
flat set: {1, 2, 3, 4, 5}


In [None]:
import time 
import sys 
# Generators 

# Generators are iterables that yield objects one at the time when they are needed. I.e, they are a way of avoiding storing a large 
# iterable in memory, and rather just load just as much as you need when you need it. 
# You've probably already used generators like file.readline() and range already. 

# The syntax is quite similar to the other comprehensions, though the expression is enclosed in () rather than [] or {}

# Example: 

def square_10million_list_comprehension(): 
    """ Method for squaring  up to 10⁷ using a list comprehension. """
    start_time = time.time() 
    squares = [i**2 for i in range(10**7)] 
    end_time = time.time()
    elapsed_time = end_time - start_time
    memory_usage = sys.getsizeof(squares)
    print("List comprehension:")
    print(f"Elapsed time: {elapsed_time:.2f} seconds")
    print(f"Memory usage: {memory_usage:,} bytes (~{memory_usage / (1024**2):.2f} MB)")
 

# uncomment the next line to run the list comprehension
#square_10million_list_comprehension() 

# here is a generator version: 

def square_10million_generator(): 
    """ Method for squaring up to 10⁷ using a generator comprehension. """ 
    start_time = time.time() 
    squares = (i**2 for i in range(10**7)) 
    end_time = time.time()
    print("Generator")
    elapsed_time = end_time - start_time
    memory_usage = sys.getsizeof(squares)
    print(f"Elapsed time: {elapsed_time:.2f} seconds")
    print(f"Memory usage: {memory_usage:,} bytes (~{memory_usage / (1024**2):.2f} MB)")
    return squares

squares = square_10million_generator() 
 
# The generator has now been created, though nothing has yet been computed. 


# using the generator, getting the n first squares: 

n = 15

for _ in range(15): 
    print(next(squares))


# A final point is that generators can be exhausted: 

new_generator = (i for i in range(3)) 

n = 3 
# Uncomment the line bellow to ask the generator to yield more even after it's exhausted. 
# n = 4
for i in range(n): 
    print(next(new_generator)) 

# chatgpt help me think clearer about generators and provided the syntax for getting something out of a generator. 
# It also helped me to find a good way to find how much is loaded into the memory. 

# If this is a bit fuzzy, then hopefully it will be clearer for both you and I as we use generators in real problems further down the line. 





List comprehension:
Elapsed time: 0.66 seconds
Memory usage: 89,095,160 bytes (~84.97 MB)
Generator
Elapsed time: 0.00 seconds
Memory usage: 200 bytes (~0.00 MB)
0
1
4
9
16
25
36
49
64
81
100
121
144
169
196
0
1
2


StopIteration: 

# On the numerical representation of natural language, and what is a vector anyways? 


# Bag of words

In [None]:
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk
import re
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from wordcloud import WordCloud

nltk.download("punkt")
nltk.download("stopwords")

stopwords = set(stopwords.words("norwegian"))
#print(stopwords)



def main():
    with open("epiktet-frihet.txt", "r", encoding="utf-8") as f: 
        sample = f.read()
        sample = preprocess(sample)



    counter_dict = Counter(word_tokenize(sample))
    sorted_counter_dict = counter_dict.most_common(25)

    for k in sorted_counter_dict: 
        print(k)

    
    visualize(counter_dict)
    
def preprocess(text : str) -> str: 
    """
    Removes html-tags from text.
    
    Args: 
        text (str): the input text.
        
    Returns: 
        cleaned_text (str): the cleaned text.
    """
    text = re.sub("<.*?>", "", text)
    text = re.sub("[^\w\s]", " ", text, flags=re.UNICODE)
    text = text.lower()
    words = text.split()
    words = [word for word in words if word not in stopwords]
    return " ".join(words)


def visualize(counter_dict : Counter) -> None:
    most_common = counter_dict.most_common(15)
    words, freq = zip(*most_common)
    
    plt.bar(words, freq)
    plt.xticks(rotation=45)
    plt.title("Top 15 words")
    plt.savefig("plot.jpg")
    
    wordcloud = WordCloud().generate_from_frequencies(counter_dict)
    wordcloud.to_file("kvakk.jpg")
    

if __name__ == "__main__": 
    main()

# Dot-product

The forumla for finding the dot product is the following: 

$a \cdot b =\sum_{i=0}^{n - 1}(a_ib_i)$


In other words, for each feature in vector a and b, take the sum of the product of feature i in vector a with feature i in vector b from index 0 to the last index of the vectors. (Where we count index 0 as the first index of a vector.) The algorithm assumes that the vectors are of equal length. 

Example: 

a = [1, 2, 3]

b = [2, 2, 2]

dot product = ((1 * 2) + (2 * 2) + (3 * 2) ) = 12

Example 2: 

a = [1, 1, 7]

b = [2, 3, 6]

dot product = ((1 * 2) + (1 * 3) + (7 * 6) ) = 47


Example 3: 

a = [0]

b = [1]

dot product = 0 * 1 = 0


Here is an implementation in Python, meant to be readable. 


In [9]:
def dot_product(vector1: list[float], vector2 : list[float]) -> float :
    """
    A method for finding the dot product of two vectors.
    
    Args: 
        vector1 (list[float]): a list representing a vector.
        vector2 (list[float]): a list representing a different vector. 
        
    Returns: 
        sum (float): the sum of the calculation.
        
    Raises: 
        ValueError: If the vectors are not of the same length. 
    """
    
    if len(vector1) != len(vector2):
        raise ValueError("Vectors must be of the same length")
        
    
    total = 0 
    for v1, v2 in zip(vector1, vector2): 
        total += (v1 * v2)
    return total

print(f"Dot product, [1, 2, 3], [2, 2, 2] = {dot_product([1, 2, 3], [2, 2, 2])}" )
print(f"Dot product, [1, 1, 7], [2, 3, 6] = {dot_product([1, 1, 7], [2, 3, 6])}" )
print(f"Dot product, [0], [1] = {dot_product([0], [1])}" )






Dot product, [1, 2, 3], [2, 2, 2] = 12
Dot product, [1, 1, 7], [2, 3, 6] = 47
Dot product, [0], [1] = 0


# Euclidean distance

The Euclidean distance between two vectors **a** and **b** is calculated as:

$$
d(a, b) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}
$$


#TODO add examples and calcuations. 

We can do this in a very straightforward way in Python like this: 


In [None]:
def euclidean_distance(
        vector1: list[float], vector2 : list[float]
        ) -> float :

        if len(vector1) != len(vector2):
            raise ValueError("Vectors must be of the same length")
            
        
        total = 0 
        for v1, v2 in zip(vector1, vector2): 
            total += (v1 - v2)**2
        return math.sqrt(total)

# Or with a list comprehension
 def euclidean_distance(vector1, vector2): 
        return math.sqrt(sum( x - y) ** 2 for x, y in zip(vector1, vector2))

#TODO add some example usage
#TODO add docstring

# Length normalization

$$\frac{x}{||x||}$$

**Length of a vector**
$$||x|| = \sqrt{x \cdot x} = \sqrt{\sum_{i=1}^nx_i^2}$$




In [1]:
import math

def length_normalization(vector: list[float]) -> list[float] :
    """
    A method for normalizing a vector.
    
    Args: 
        vector1 (list[float]): a list representing a vector.
         
        
    Returns: 
        normalized_vector (list[float]): the sum of the calculation.
        
    Raises: 
        ValueError: If it is a zero-length vector. 
    """
    
 
    
    total = 0
    for element in vector: 
        total += element ** 2
    length = math.sqrt(total)
    
    if length == 0: 
        raise ValueError("cannot normalize a zero-length vector")
    normalized_vector = [x/ length for x in vector]

    return normalized_vector

#TODO add doc strings and examples. 

# Cosine similarity

TODO - everything

In [2]:
def cosine_similarity(vector1: list[float], vector2 : list[float]) -> float :
    """
    A method for finding the cosine similarity of two vectors.
    
    Args: 
        vector1 (list[float]): a list representing a vector.
        vector2 (list[float]): a list representing a different vector. 
        
    Returns: 
        dotproduct (float): the dot product of the normalized vectors.
        
    Raises: 
        ValueError: If the vectors are not of the same length. 
    """
    
    if len(vector1) != len(vector2):
        raise ValueError("Vectors must be of the same length")
    if all(v == 0 for v in vector1) or all(v == 0 for v in vector2): 
        return 0.0
        
    
    vector1_normalized = length_normalization(vector1)
    vector2_normalized = length_normalization(vector2)
    
    return dot_product(vector1_normalized, vector2_normalized)

#TODO add examples




# TF-IDF

# Mini project

# Mini project - using external libraries.

# About, credits and so on. 

## How have I used LLMs

## Where can you read more about these concepts? 

## How do I draw on In2110 