# Project: Sentences similarity

## Team members:

* Khanh Duong Tran.
* Brandon.

## Goal: Compare two sentences similarity by calculating Cosine similarity
### Main ideas:

#### 1. Vectorize the sentences.
#### 2. Calculate Cosine similarity.
#### 3. Compare the similarity.

## Vectorize sentences

### Step includes:
#### 1. Process the sentences.
#### 2. Tokenize the sentences.
#### 3. Convert into array.

### Process the sentences:
#### Step includes:

##### 1. Remove non-word character such as punctuation, numbers, emojis, etc.
##### 2. Remove redundant spaces.
##### 3. Convert all to lower case.

## Calculate Cosine similarity

### The formula is:

$$
$\cos(θ)$ =  $\frac{A \cdot B}{\|A\| \times \|B\|}$
$$

## Compare

### We define the threshold for the similarity is 0.5.

In [1]:
import string

def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = sentence.split()
    return words

def create_vocabulary(sentences):
    vocabulary = set()
    for sentence in sentences:
        words = preprocess_sentence(sentence)
        vocabulary.update(words)
    return {word: index for index, word in enumerate(vocabulary)}

def vectorize_sentence(sentence, vocabulary):
    vector = [0] * len(vocabulary)
    words = preprocess_sentence(sentence)
    for word in words:
        if word in vocabulary:
            vector[vocabulary[word]] += 1
    return vector

sentences = ["This is the first sentence.", "This is the second sentence."]
vocabulary = create_vocabulary(sentences)

vector1 = vectorize_sentence(sentences[0], vocabulary)
vector2 = vectorize_sentence(sentences[1], vocabulary)

print(vector1)
print(vector2)

[1, 1, 1, 1, 0, 1]
[0, 1, 1, 1, 1, 1]


In [2]:
import string

def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = sentence.split()
    return words

def create_vocabulary(sentences):
    vocabulary = set()
    for sentence in sentences:
        words = preprocess_sentence(sentence)
        vocabulary.update(words)
    return {word: index for index, word in enumerate(vocabulary)}

def vectorize_sentence(sentence, vocabulary):
    vector = [0] * len(vocabulary)
    words = preprocess_sentence(sentence)
    for word in words:
        if word in vocabulary:
            vector[vocabulary[word]] += 1
            print(f"Incremented element {vocabulary[word]} due to word '{word}'.")
        else:
            print(f"Ignored word '{word}' as it is not in the vocabulary.")
    return vector

sentences = ["This is the first sentence.", "This is the second sentence."]
vocabulary = create_vocabulary(sentences)

vector1 = vectorize_sentence(sentences[0], vocabulary)
vector2 = vectorize_sentence(sentences[1], vocabulary)

print(vector1)
print(vector2)


Incremented element 2 due to word 'this'.
Incremented element 1 due to word 'is'.
Incremented element 5 due to word 'the'.
Incremented element 0 due to word 'first'.
Incremented element 3 due to word 'sentence'.
Incremented element 2 due to word 'this'.
Incremented element 1 due to word 'is'.
Incremented element 5 due to word 'the'.
Incremented element 4 due to word 'second'.
Incremented element 3 due to word 'sentence'.
[1, 1, 1, 1, 0, 1]
[0, 1, 1, 1, 1, 1]
