# Hello there, welcome to your first notebook!

We are going to be looking at:
* Representing sentences using the bag-of-words model
* Measuring the 'distance' between sentences using the Jaccard distance
* Embedding points using UMAP

It is recommended that you complete all exercises that are not marked as optional.

Feel free to be creative and write your own code wherever you want!

The provided functions are only there to help you if you get stuck :)

## Imports

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import umap
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
import random
warnings.filterwarnings('ignore')

## Lesson 1: Sentences

Our first goal is to create a sentence.

One way to do this is to select random words from a vocabulary.

In [3]:
def generate_sentence(vocab, sentence_length=10):
    
    # Create an array of random words from the vocab
    sentence_array = random.choices(vocab, k=sentence_length)
    
    # Join the words together with a space inbetween
    sentence = ' '.join(sentence_array)
    
    return sentence

numbers = ['one', 'two', 'three', 'four', 'five', 'six']
numbers_sentence = generate_sentence(numbers)
print(numbers_sentence)

four six four four five one one four six three


### Exercise 1: Creating your own sentence

In [None]:
my_vocab = # TODO: Define the vocab
my_sentence = # TODO: Create a sentence
# TODO: Print the output

## Lesson 2: Multiple sentences

Our next goal is to generate multiple sentences.

One way to do this is to select a random vocabulary and then generate a sentence like in Lesson 1.

In [5]:
numbers = ['one', 'two', 'three', 'four']
colours = ['blue', 'yellow', 'red', 'green']
animals = ['cow', 'sheep', 'pig', 'horse']

vocabularies = [numbers, colours, animals]

In [6]:
def generate_sentences_and_labels(vocabularies, n_sentences=20, sentence_length=5):
    sentences = []
    labels = []
    for _ in range(n_sentences):
        # Randomly select the vocab
        vocab_idx = random.randint(0, len(vocabularies)-1)
        vocab = vocabularies[vocab_idx]
        
        # Generate a sentence from that vocab
        sentence = generate_sentence(vocab, sentence_length)
        
        sentences.append(sentence)
        labels.append(str(vocab_idx))
        
    return sentences, labels

sentences, labels = generate_sentences_and_labels(vocabularies)
for i in range(len(sentences)):
    print(f'{labels[i]}: {sentences[i]}')

2: sheep pig sheep cow pig
2: pig horse sheep pig sheep
1: yellow red red yellow red
1: yellow blue yellow yellow blue
0: one three one three four
2: pig cow sheep horse horse
1: green green red yellow yellow
2: cow horse horse horse pig
1: red yellow blue red red
2: cow sheep pig sheep cow
1: yellow yellow red red green
0: three four three three three
0: three one four one three
1: yellow green blue yellow yellow
1: yellow yellow blue red green
2: cow pig cow horse cow
1: blue red yellow blue red
0: three four one four two
2: cow sheep horse cow horse
1: blue yellow green red yellow


### Exercise 2: Creating multiple sentences

In [None]:
vocabularies = # TODO: Define your own vocabularies
sentences, labels = # TODO: Create sentences and labels
# TODO: Print the output

## Lesson 3: Bag-of-words (BoW)

Our next goal is to represent sentences using the bag-of-words model.

The bag-of-words model is the representation of a sentence as the counts of its words.

I.e. The sentence 'one plus one makes two' can be represented as [('one', 2), ('plus', 1), ('makes', 1), ('two', 1)]

It could also be represented as [2, 1, 1, 1] with vocab ['one', 'plus', 'makes', 'two']

In the model the order of the words is completely ignored.

This means that the sentences 'alice likes bob' and 'bob likes alice' have the same representation!

We are lazy so we are going import an existing bag-of-words model called CountVectorizer.

In [7]:
def get_bow_and_vocab(sentences):
    cv = CountVectorizer()
    bow = cv.fit_transform(sentences).toarray()
    vocab = cv.get_feature_names()
    
    return bow, vocab

bow, vocab = get_bow_and_vocab(sentences)
print(list(zip(vocab, bow[0])))
print(sentences[0])

[('blue', 0), ('cow', 1), ('four', 0), ('green', 0), ('horse', 0), ('one', 0), ('pig', 2), ('red', 0), ('sheep', 2), ('three', 0), ('two', 0), ('yellow', 0)]
sheep pig sheep cow pig


### Exercise 3: Writing a bag-of-words model (Optional)

In [6]:
def bag_of_words_model(sentences):
    # TODO: Write your own bag-of-words model
    bow = None
    vocab = None
    return bow, vocab

In [7]:
# TODO: Check that your model is working

trial_sentences = ['chicken cow', 'cow sheep']
bow, vocab = get_bow_and_vocab(trial_sentences)  # The output from CountVectorizer
your_bow, your_vocab = bag_of_words_model(trial_sentences) # Your output

print(bow, your_bow)
print(vocab, your_vocab)

NameError: name 'get_bow_and_vocab' is not defined

## Lesson 4: Jaccard distance

Our next goal is to measure the distances between sentences which have been represented using the bag-of-words model.

The first step is to convert our representations to sets.
Given a representation $R$ we define a set $S$ as

$S[i] = \begin{cases}
  1, & \text{if } R[i] > 0, \\
  0, & \text{otherwise}.
\end{cases}$

Now we can use the Jaccard distance,
which given two sets $A$ and $B$ is defined as

$J(A, B) = 1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$

In words, this is one minus the size of the intersection of A and B over the size of the union of A and B. If you've forgotten what these terms mean, here's a quick reminder

<img src=https://i.stack.imgur.com/uH6cL.png width="400">

In [8]:
def calculate_jaccard_dist(u, v):
    intersection = 0
    union = 0
    for i in range(len(u)):
        
        # The word is present in both vectors
        if u[i] > 0 and v[i] > 0:
            intersection += 1
            
        # The word is present in one of the vectors
        if u[i] > 0 or v[i] > 0:
            union += 1
            
    jaccard_dist = 1 - (intersection / union)
    return jaccard_dist

u = [1, 1, 0]
v = [0, 1, 1]
jaccard_dist = calculate_jaccard_dist(u, v)
print(f'The Jaccard distance between {u} and {v} is {jaccard_dist:.2f}')

The Jaccard distance between [1, 1, 0] and [0, 1, 1] is 0.67


### Exercise 4: Understanding the Jaccard distance

In [None]:
# Q4.1.1 - What is the Jaccard distance between the sentences ['hi'] and ['hi', 'hi']
# Q4.1.2 - Does the answer to 4.1.1 surprise you?

In [None]:
# Q4.2 - Can you find two sentences that have a Jaccard distance of 0.5?

In [None]:
# Q4.3.1 - What is the smallest value that the Jaccard distance can take?
# Q4.3.2 - Can you find two sentences that have this distance?

In [None]:
# Q4.4.1 - What is the largest value that the Jaccard distance can take?
# Q4.4.2 - Can you find two sentences that have this distance?

# Lesson 5: UMAP

UMAP is a dimension-reduction technique that can be used for visualization.

Given a set of vectors, it can embed them such that similar vectors lie close together.

It can accept multiple metrics but we are going to use the Jaccard distance.

Note: If you get an error when running `plot_sentences_and_labels` then you probably need to given it more sentences.

In [10]:
def plot_sentences_and_labels(sentences, labels=None, metric='jaccard'):
    bow, _ = get_bow_and_vocab(sentences) 
    
    # Embed the vectors into 2d using UMAP with the given metric
    reducer = umap.UMAP(metric=metric)
    embedding = reducer.fit_transform(bow)
    
    # Create a DataFrame (it is required for px.scatter)
    df = pd.DataFrame()
    df['x'] = embedding[:, 0]
    df['y'] = embedding[:, 1]
    df['sentence'] = sentences
    df['label'] = labels if labels else None
    
    # Create an interactive scatter plot, with colours given by the labels
    color = 'label' if labels else None
    fig = px.scatter(df, 
                     x='x',
                     y='y', 
                     color=color,
                     hover_data=['sentence'])
    fig.show()

numbers = ['one', 'two', 'three', 'four']
colours = ['blue', 'yellow', 'red', 'green']
animals = ['cow', 'sheep', 'pig', 'horse']
vocabularies = [numbers, colours, animals]
sentences, labels = generate_sentences_and_labels(vocabularies)
plot_sentences_and_labels(sentences, labels)

## Exercise 5: Experimenting with UMAP (Optional)

In [None]:
# Q5.1.1 - Generate sentences with overlapping vocabularies 

In [None]:
# Q5.1.2 - Plot these sentences
#        - What do you see? 
#        - How are the clusters arranged in the space?