# Hello there, welcome to notebook 3!

We are going to be looking at:
* Representing sentences using the bag-of-words model
* Measuring the 'distance' between sentences using the Jaccard distance
* Embedding points using UMAP

It is recommended that you complete all exercises that are not marked as optional.

Feel free to be creative and write your own code wherever you want!

The provided functions are only there to help you if you get stuck :)

## Imports

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import umap
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
import random
#warnings.filterwarnings('ignore')

## Lesson 1: Sentences

Our first goal is to create a sentence.

One way to do this is to select random words from a vocabulary.

In [6]:
def generate_sentence(vocab, sentence_length=10):
    
    # Create an array of random words from the vocab
    sentence_array = random.choices(vocab, k=sentence_length)
    
    # Join the words together with a space inbetween
    sentence = ' '.join(sentence_array)
    
    return sentence

numbers = ['one', 'two', 'three', 'four', 'five', 'six']
numbers_sentence = generate_sentence(numbers)
print(numbers_sentence)

five six two six three five six two two five


### Exercise 1: Creating your own sentence

In [7]:
my_vocab = ["test", "hello","apple","banana"]
my_sentence = generate_sentence(my_vocab)
print(my_sentence)

banana apple hello hello test test apple banana hello apple


## Lesson 2: Multiple sentences

Our next goal is to generate multiple sentences.

One way to do this is to select a random vocabulary and then generate a sentence like in Lesson 1.

In [9]:
numbers = ['one', 'two', 'three', 'four']
colours = ['blue', 'yellow', 'red', 'green']
animals = ['cow', 'sheep', 'pig', 'horse']

vocabularies = [numbers, colours, animals]

In [19]:
def generate_sentences_and_labels(vocabularies, n_sentences=20, sentence_length=5):
    sentences = []
    labels = []
    for _ in range(n_sentences):
        # Randomly select the vocab
        vocab_idx = random.randint(0, len(vocabularies)-1)
        vocab = vocabularies[vocab_idx]
        
        # Generate a sentence from that vocab
        sentence = generate_sentence(vocab, sentence_length)
        
        sentences.append(sentence)
        labels.append(str(vocab_idx))
        
    return sentences, labels

sentences, labels = generate_sentences_and_labels(vocabularies)
for i in range(len(sentences)):
    print(f'{labels[i]}: {sentences[i]}')

2: Wales England Wales England England
0: apple banana pear pear pear
2: Wales Wales England Wales England
2: Scotland Wales Wales Scotland England
0: apple pear pear banana apple
1: sheep sheep pig sheep pig
0: pear banana banana apple banana
0: pear banana pear pear apple
0: banana banana banana apple banana
1: pig pig pig pig pig
2: England Wales Scotland Wales Scotland
1: pig cow sheep cow cow
2: Wales England Scotland Scotland Wales
1: sheep pig cow sheep pig
0: apple pear apple pear apple
1: pig cow cow sheep pig
1: pig sheep pig pig pig
2: England Scotland Scotland England England
1: sheep pig pig pig cow
1: cow pig cow pig sheep


### Exercise 2: Creating multiple sentences

In [11]:
vocabularies = [["banana","apple","pear"],["pig", "sheep", "cow"],["England","Scotland","Wales"]]
sentences, labels = generate_sentences_and_labels(vocabularies)
for i in range(len(sentences)):
    print(f'{labels[i]}: {sentences[i]}')
    

2: Wales England Scotland Wales Scotland
0: apple apple banana pear apple
2: England Scotland England England England
1: pig sheep cow cow pig
2: Scotland Scotland Wales Scotland Scotland
2: Scotland England Scotland Scotland England
2: Scotland Scotland Wales Wales England
2: Scotland England Scotland England Scotland
0: pear apple apple apple pear
2: Scotland Scotland England Scotland Scotland
1: sheep sheep pig pig pig
0: apple banana apple apple apple
1: sheep sheep pig sheep cow
1: pig pig sheep sheep pig
2: Scotland Scotland Scotland England England
2: Wales England Scotland Wales England
2: England Scotland Wales England Scotland
1: cow sheep sheep cow cow
0: apple banana banana pear pear
0: pear pear banana apple banana


## Lesson 3: Bag-of-words (BoW)

Our next goal is to represent sentences using the bag-of-words model.

The bag-of-words model is the representation of a sentence as the counts of its words.

I.e. The sentence 'one plus one makes two' can be represented as [('one', 2), ('plus', 1), ('makes', 1), ('two', 1)]

It could also be represented as [2, 1, 1, 1] with vocab ['one', 'plus', 'makes', 'two']

In the model the order of the words is completely ignored.

This means that the sentences 'alice likes bob' and 'bob likes alice' have the same representation!

We are lazy so we are going import an existing bag-of-words model called CountVectorizer.

In [16]:
def get_bow_and_vocab(sentences):
    cv = CountVectorizer()
    print(sentences)
    bow = cv.fit_transform(sentences).toarray()
    vocab = cv.get_feature_names()
    print(bow,vocab)
    return bow, vocab
bow, vocab = get_bow_and_vocab(sentences)
print(list(zip(vocab, bow[0])))
print(sentences[0])

['Wales England Scotland Wales Scotland', 'apple apple banana pear apple', 'England Scotland England England England', 'pig sheep cow cow pig', 'Scotland Scotland Wales Scotland Scotland', 'Scotland England Scotland Scotland England', 'Scotland Scotland Wales Wales England', 'Scotland England Scotland England Scotland', 'pear apple apple apple pear', 'Scotland Scotland England Scotland Scotland', 'sheep sheep pig pig pig', 'apple banana apple apple apple', 'sheep sheep pig sheep cow', 'pig pig sheep sheep pig', 'Scotland Scotland Scotland England England', 'Wales England Scotland Wales England', 'England Scotland Wales England Scotland', 'cow sheep sheep cow cow', 'apple banana banana pear pear', 'pear pear banana apple banana']
[[0 0 0 1 0 0 2 0 2]
 [3 1 0 0 1 0 0 0 0]
 [0 0 0 4 0 0 1 0 0]
 [0 0 2 0 0 2 0 1 0]
 [0 0 0 0 0 0 4 0 1]
 [0 0 0 2 0 0 3 0 0]
 [0 0 0 1 0 0 2 0 2]
 [0 0 0 2 0 0 3 0 0]
 [3 0 0 0 2 0 0 0 0]
 [0 0 0 1 0 0 4 0 0]
 [0 0 0 0 0 3 0 2 0]
 [4 1 0 0 0 0 0 0 0]
 [0 0 1 0

### Exercise 3: Writing a bag-of-words model (Optional)

In [14]:
def bag_of_words_model(sentences):
    freq = {}
    for item in sentences:
        for word in item:
            if word not in freq.keys():
                freq[word] = 1
            else:
                freq[word] += 1
    bow = freq
    vocab = list(freq.keys())
    return bow, vocab

In [15]:

trial_sentences = ['chicken cow', 'cow sheep']
bow, vocab = get_bow_and_vocab(trial_sentences)  # The output from CountVectorizer
your_bow, your_vocab = bag_of_words_model(trial_sentences) # Your output

print(bow, your_bow)
print("\n")
print(vocab, your_vocab)

[[1 1 0]
 [0 1 1]] ['chicken', 'cow', 'sheep']
[[1 1 0]
 [0 1 1]] {'c': 4, 'h': 2, 'i': 1, 'k': 1, 'e': 3, 'n': 1, ' ': 2, 'o': 2, 'w': 2, 's': 1, 'p': 1}


['chicken', 'cow', 'sheep'] ['c', 'h', 'i', 'k', 'e', 'n', ' ', 'o', 'w', 's', 'p']


## Lesson 4: Jaccard distance

Our next goal is to measure the distances between sentences which have been represented using the bag-of-words model.

The first step is to convert our representations to sets.
Given a representation $R$ we define a set $S$ as

$S[i] = \begin{cases}
  1, & \text{if } R[i] > 0, \\
  0, & \text{otherwise}.
\end{cases}$

Now we can use the Jaccard distance,
which given two sets $A$ and $B$ is defined as

$J(A, B) = 1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$

In words, this is one minus the size of the intersection of A and B over the size of the union of A and B. If you've forgotten what these terms mean, here's a quick reminder

<img src=images/set.png width="400">

In [20]:
def calculate_jaccard_dist(u, v):
    intersection = 0
    union = 0
    for i in range(len(u)):
        
        # The word is present in both vectors
        if u[i] > 0 and v[i] > 0:
            intersection += 1
            
        # The word is present in one of the vectors
        if u[i] > 0 or v[i] > 0:
            union += 1
            
    jaccard_dist = 1 - (intersection / union)
    return jaccard_dist

u = [1, 1, 0]
v = [0, 1, 1]
jaccard_dist = calculate_jaccard_dist(u, v)
print(f'The Jaccard distance between {u} and {v} is {jaccard_dist:.2f}')

The Jaccard distance between [1, 1, 0] and [0, 1, 1] is 0.67


### Exercise 4: Understanding the Jaccard distance

In [22]:
# Q4.1.1 - What is the Jaccard distance between the sentences ['hi'] and ['hi', 'hi']

bow, vocab = get_bow_and_vocab(["hi","hi hi"])
print(bow)
jaccard_dist = calculate_jaccard_dist(bow[0], bow[1])
print(f'The Jaccard distance between {bow[0]} and {bow[1]} is {jaccard_dist:.2f}')
# Q4.1.2 - Does the answer to 4.1.1 surprise you?

[[1]
 [2]] ['hi']
[[1]
 [2]]
The Jaccard distance between [1] and [2] is 0.00


In [24]:
# Q4.2 - Can you find two sentences that have a Jaccard distance of 0.5?
bow, vocab = get_bow_and_vocab(["hello there","hello there General Kenobi"])
print(bow)
jaccard_dist = calculate_jaccard_dist(bow[0], bow[1])
print(f'The Jaccard distance between {bow[0]} and {bow[1]} is {jaccard_dist:.2f}')

[[0 1 0 1]
 [1 1 1 1]] ['general', 'hello', 'kenobi', 'there']
[[0 1 0 1]
 [1 1 1 1]]
The Jaccard distance between [0 1 0 1] and [1 1 1 1] is 0.50


In [26]:
# Q4.3.1 - What is the smallest value that the Jaccard distance can take?
#0
# Q4.3.2 - Can you find two sentences that have this distance?

In [28]:
# Q4.4.1 - What is the largest value that the Jaccard distance can take?
#1
# Q4.4.2 - Can you find two sentences that have this distance?
bow, vocab = get_bow_and_vocab(["hello there","another sentence"])
print(bow)
jaccard_dist = calculate_jaccard_dist(bow[0], bow[1])
print(f'The Jaccard distance between {bow[0]} and {bow[1]} is {jaccard_dist:.2f}')

[[0 1 0 1]
 [1 0 1 0]] ['another', 'hello', 'sentence', 'there']
[[0 1 0 1]
 [1 0 1 0]]
The Jaccard distance between [0 1 0 1] and [1 0 1 0] is 1.00


# Lesson 5: UMAP

UMAP is a dimension-reduction technique that can be used for visualization.

Given a set of vectors, it can embed them such that similar vectors lie close together.

It can accept multiple metrics but we are going to use the Jaccard distance.

Note: If you get an error when running `plot_sentences_and_labels` then you probably need to given it more sentences.

In [22]:
def plot_sentences_and_labels(sentences, labels=None, metric='jaccard'):
    bow, _ = get_bow_and_vocab(sentences) 
    
    # Embed the vectors into 2d using UMAP with the given metric
    reducer = umap.UMAP(metric=metric)
    embedding = reducer.fit_transform(bow)
    
    # Create a DataFrame (it is required for px.scatter)
    df = pd.DataFrame()
    df['x'] = embedding[:, 0]
    df['y'] = embedding[:, 1]
    df['sentence'] = sentences
    df['label'] = labels if labels else None
    
    # Create an interactive scatter plot, with colours given by the labels
    color = 'label' if labels else None
    fig = px.scatter(df, 
                     x='x',
                     y='y', 
                     color=color,
                     hover_data=['sentence'])
    fig.show()

numbers = ['one', 'two', 'three', 'four']
colours = ['blue', 'yellow', 'red', 'green']
animals = ['cow', 'sheep', 'pig', 'horse']
vocabularies = [numbers, colours, animals]
sentences, labels = generate_sentences_and_labels(vocabularies)
print(sentences)
plot_sentences_and_labels(sentences, labels)

['pig pig horse cow sheep', 'yellow green red red green', 'pig horse sheep pig cow', 'four three four four two', 'green red red red red', 'cow pig sheep cow sheep', 'four four four four three', 'one one four two one', 'four four four four three', 'horse pig cow horse sheep', 'red blue yellow yellow yellow', 'three one two three one', 'blue red red red green', 'blue yellow yellow green red', 'green red blue green yellow', 'sheep horse horse pig cow', 'green yellow yellow green yellow', 'two four four three three', 'green green blue green red', 'cow pig sheep sheep cow']
['pig pig horse cow sheep', 'yellow green red red green', 'pig horse sheep pig cow', 'four three four four two', 'green red red red red', 'cow pig sheep cow sheep', 'four four four four three', 'one one four two one', 'four four four four three', 'horse pig cow horse sheep', 'red blue yellow yellow yellow', 'three one two three one', 'blue red red red green', 'blue yellow yellow green red', 'green red blue green yellow',

## Exercise 5: Experimenting with UMAP (Optional)

In [32]:
# Q5.1.1 - Generate sentences with overlapping vocabularies 

In [34]:
numbers = ['one', 'two', 'three', 'four']
colours = ['blue', 'yellow', 'red', 'green']
animals = ['cow', 'sheep', 'pig', 'horse']
vocabularies = [numbers, colours, animals]
sentences, labels = generate_sentences_and_labels(vocabularies)
plot_sentences_and_labels(sentences, labels)

In [46]:
# Q5.1.2 - Plot these sentences
#        - What do you see? 
#        - How are the clusters arranged in the space?