# Distributional Semantic Models in Python

## EMLAR 2025

### Authors
* Raquel G. Alhama, Tilburg University
* Andrew Jessop, University of Liverpool
* Meaghan Fowlie, Utrecht University

##  Jupyter Notebook commands

* Run the code in a cell: many options:
    * Push the play button (possibly labelled "Run")
    * CTRL-ENTER
    * SHIFT-ENTER
    * Cell > Run Cells
* Make a new cell: + button
* Written code will autosave
* Re-start Python (but keep all your code): Kernel > Restart & Clear Output
* Autocomplete: TAB > select option

## A couple of Python notes

* Comments are preceded by a #. Python will ignore them, so you can use them to write notes to yourself/your reader but also to temporarily remove code without deleting it.
* Comment/Uncomment several lines at once: highlight > CTRL-/

In [1]:
# First we import some libraries that will be useful
# You may need to uncomment the following lines the first time you run the code
# !conda install --yes  numpy nltk 
# !conda install --yes  -c conda-forge scikit-learn 

import re
import nltk
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

## Data Preprocessing


You can try this code with any textual data. For this example, we will use some [children's books from Gutenberg project](https://www.gutenberg.org/ebooks/bookshelves/search/?query=children|child|school).

Note: if you get this error, it's just Jupyter Notebooks complaining about you asking it to print too much stuff. To fix it, print less (here, `[4000:]` means character number 4000 to the end, so increase the number to start later in the text.)

```
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
```




In [2]:
# Load data from book that we have previously downloaded from Gutenberg project
raw = open("data/pg6328.txt", 'r').read()

# The variable "raw" contains now all the text from this file. 

# Let's have a look at some of the data:
# print(raw[4000:])

In [3]:
#At the moment, what raw contains is a string of characters. But we are interested in words rather than characters.

# How can we separate this text into sentences?

# This process is called "tokenization". 

# First we need to download a tokenizer from NLTK (uncomment and run just once):
# nltk.download('punkt')



[['CONTENTS', 'PREFACE', 'ARABIAN', 'NIGHTS', 'Ali', 'Baba', 'and', 'the', 'Forty', 'Thieves', 'The', 'Story', 'of', 'Aladdin', ';', 'or', 'the', 'Wonderful', 'Lamp', 'Sindbad', 'the', 'Sailor', 'ROBINSON', 'CRUSOE', 'Robinson', 'Crusoe', 'is', 'Shipwrecked', '_Daniel', 'Defoe_', 'Alone', 'on', 'a', 'Desolate', 'Island', '_Daniel', 'Defoe_', 'The', 'Building', 'of', 'the', 'Boat', '_Daniel', 'Defoe_', 'Finds', 'the', 'Print', 'of', 'a', 'Man', "'", 's', 'Foot', 'on', 'the', 'Sand', '_Daniel', 'Defoe_', 'Friday', 'Rescued', 'from', 'the', 'Cannibals', '_Daniel', 'Defoe_', 'Robinson', 'Crusoe', 'Rescued', '_Daniel', 'Defoe_', 'GULLIVER', "'", 'S', 'TRAVELS', 'Gulliver', 'is', 'Shipwrecked', 'and', 'Swims', 'for', 'His', 'Life', '_Jonathan', 'Swift_', 'Gulliver', 'at', 'the', 'Court', 'of', 'Lilliput', '_Jonathan', 'Swift_', 'Gulliver', 'Captures', 'Fifty', 'of', 'the', 'Enemy', "'", 's', 'Ships', '_Jonathan', 'Swift_', 'Gulliver', 'Leaves', 'Lilliput', '_Jonathan', 'Swift_', 'Gulliver', 

In [24]:
# Now let's tokenize each sentence.

# In some data, each sentence has its own line, so you can split it with the split method as follows:
lines = raw.split("\n")

# This isn't an appropriate split for this data, as we can see if we print out some:
print("A few lines:")
for i in range(1000, 1010):
    print("line", i, lines[i])



A few lines:
line 1000 
line 1001 While Ali Baba took these measures, the captain of the forty
line 1002 robbers returned to the forest with inconceivable mortification.
line 1003 He did not stay long: the loneliness of the gloomy cavern became
line 1004 frightful to him. He determined, however, to avenge the fate of
line 1005 his companions, and to accomplish the death of Ali Baba. For this
line 1006 purpose he returned to the town and took a lodging in a khan, and
line 1007 disguised himself as a merchant in silks. Under this assumed
line 1008 character, he gradually conveyed a great many sorts of rich stuffs
line 1009 and fine linen to his lodging from the cavern, but with all the


In [23]:
# Instead, nltk.sent_tokenize gives us a method to do split into actual sentences
sentences = nltk.sent_tokenize(raw)

print("A few sentences:")
for i in range(1000, 1010):
    print("\nsentence", i, sentences[i])

# And we tokenize all the words in each sentence and collect them together 
tokenized = []
header = True
for sentence in sentences:
    if header and sentence.startswith("CONTENTS"):
        header = False
    if not header: # We ignore everything before the table of contents
        tokenized.append(nltk.wordpunct_tokenize(sentence))
    

# Let's look at the first 100 sentences. Do you spot any problems? 
print("A few tokenized sentences:")
for i in range(1000, 1010):
    print("\nsentence", i, tokenized[i])

A few sentences:

sentence 1000 What has he done to obtain from Thee a lot so agreeable?

sentence 1001 And what
have I done to deserve one so wretched?"

sentence 1002 While the porter was thus indulging his melancholy, a servant came
out of the house, and taking him by the arm, bade him follow him,
for Sindbad, his master, wanted to speak to him.

sentence 1003 The servants
brought him into a great hall, where a number of people sat round
a table, covered with all sorts of savory dishes.

sentence 1004 At the upper end
sat a comely, venerable gentleman, with a long white beard, and
behind him stood a number of officers and domestics, all ready to
attend his pleasure.

sentence 1005 This person was Sindbad.

sentence 1006 Hindbad, whose fear
was increased at the sight of so many people, and of a banquet so
sumptuous, saluted the company trembling.

sentence 1007 Sindbad bade him draw
near, and seating him at his right hand, served him himself, and
gave him excellent wine, of which the

In [4]:
# Let's lowercase the text:
lowercased = []
for sentence in tokenized:
    lowercased.append( [s.lower() for s in sentence] )
        
print(lowercased[:100])

[['contents', 'preface', 'arabian', 'nights', 'ali', 'baba', 'and', 'the', 'forty', 'thieves', 'the', 'story', 'of', 'aladdin', ';', 'or', 'the', 'wonderful', 'lamp', 'sindbad', 'the', 'sailor', 'robinson', 'crusoe', 'robinson', 'crusoe', 'is', 'shipwrecked', '_daniel', 'defoe_', 'alone', 'on', 'a', 'desolate', 'island', '_daniel', 'defoe_', 'the', 'building', 'of', 'the', 'boat', '_daniel', 'defoe_', 'finds', 'the', 'print', 'of', 'a', 'man', "'", 's', 'foot', 'on', 'the', 'sand', '_daniel', 'defoe_', 'friday', 'rescued', 'from', 'the', 'cannibals', '_daniel', 'defoe_', 'robinson', 'crusoe', 'rescued', '_daniel', 'defoe_', 'gulliver', "'", 's', 'travels', 'gulliver', 'is', 'shipwrecked', 'and', 'swims', 'for', 'his', 'life', '_jonathan', 'swift_', 'gulliver', 'at', 'the', 'court', 'of', 'lilliput', '_jonathan', 'swift_', 'gulliver', 'captures', 'fifty', 'of', 'the', 'enemy', "'", 's', 'ships', '_jonathan', 'swift_', 'gulliver', 'leaves', 'lilliput', '_jonathan', 'swift_', 'gulliver', 

In this part we are going to use some *regular expressions* via the [re](https://docs.python.org/3/library/re.html) package. Regular expressions (often shortened to *regex*) are a powerful tool for finding patterns in text. It is useful for preparing data for in modelling as it helps with unwanted characters (like punctuation) or searching text.   

In [5]:
# Let's remove the all the punctuation
wordsonly = []
for sentence in lowercased:
    words = []
    for s in sentence:
        # replace anything that's not a-z with the empty string
        word = re.sub(r'[^a-z]', '', s)
        if word != '': # As long as that didn't delete the whole word, add it to the list
            words.append(word)
    wordsonly.append(words)
print(wordsonly[:100])

[['contents', 'preface', 'arabian', 'nights', 'ali', 'baba', 'and', 'the', 'forty', 'thieves', 'the', 'story', 'of', 'aladdin', 'or', 'the', 'wonderful', 'lamp', 'sindbad', 'the', 'sailor', 'robinson', 'crusoe', 'robinson', 'crusoe', 'is', 'shipwrecked', 'daniel', 'defoe', 'alone', 'on', 'a', 'desolate', 'island', 'daniel', 'defoe', 'the', 'building', 'of', 'the', 'boat', 'daniel', 'defoe', 'finds', 'the', 'print', 'of', 'a', 'man', 's', 'foot', 'on', 'the', 'sand', 'daniel', 'defoe', 'friday', 'rescued', 'from', 'the', 'cannibals', 'daniel', 'defoe', 'robinson', 'crusoe', 'rescued', 'daniel', 'defoe', 'gulliver', 's', 'travels', 'gulliver', 'is', 'shipwrecked', 'and', 'swims', 'for', 'his', 'life', 'jonathan', 'swift', 'gulliver', 'at', 'the', 'court', 'of', 'lilliput', 'jonathan', 'swift', 'gulliver', 'captures', 'fifty', 'of', 'the', 'enemy', 's', 'ships', 'jonathan', 'swift', 'gulliver', 'leaves', 'lilliput', 'jonathan', 'swift', 'gulliver', 'in', 'the', 'land', 'of', 'the', 'giant

In [6]:
# Now let's count the words! 

# We first build a dictionary with word types and their frequencies
word_frequencies = {}
# loop through the sentences
for sentence in wordsonly:
    # loop through the words of the current sentence
    for string in sentence:
        # add 1 to the count of that word 
        # (and if it's not already in the dictionary, first add it with count 0)
        word_frequencies[string] = word_frequencies.get(string, 0) + 1

# PRINTING
# print the whole thing
# print(word_frequencies)

# print just 30 (the 30 first, alphabetically)
word_types = sorted(word_frequencies.keys())
for word in word_types[:30]:
    # formatted printing: f"blah blah {python code} blah blah"
    print(f"{word}: {word_frequencies[word]}")

a: 2621
abandon: 1
abandoned: 3
abashed: 1
abate: 3
abated: 5
abating: 1
abbot: 1
abdalla: 12
abel: 2
abhor: 1
abhorred: 1
abhorrence: 1
abide: 4
ability: 4
able: 50
ablest: 1
aboard: 3
abode: 4
abortive: 1
abound: 2
abounded: 1
about: 262
above: 54
abraham: 5
abreast: 1
abroad: 11
absence: 6
absent: 7
absolute: 1


In [7]:
# Let's look at some of the most frequent words.
# We construct a list of words ordered from most frequent to most infrequent
sorted_keys = sorted(word_frequencies, key = word_frequencies.get, reverse = True)
print("These are the 10 most frequent words: ", sorted_keys[:10])
print("These are the 10 most infrequent words: ", sorted_keys[-10:])

These are the 10 most frequent words:  ['the', 'and', 'to', 'of', 'i', 'a', 'in', 'was', 'he', 'that']
These are the 10 most infrequent words:  ['scanning', 'items', 'portions', 'header', 'trailer', 'reprinted', 'sales', 'hardware', 'product', 'ver']


In [8]:
# It is generally better to restrict models to words with a minimum frequency.
# We define a minimum frequency threshold of 10 and filter the words:
minfreq = 10
target_freqs = dict([(word,freq) for word,freq in word_frequencies.items() if freq > minfreq])
# Now target_freqs is a dictionary with all the words we are interested in (we call them targets), 
# and their frequencies.

# It will be useful to have also the list of targets:
targets = sorted(target_freqs.keys())
# And the vocabulary size
vocabulary_size = len(target_freqs)
print(vocabulary_size)

1363


*Suggested Take-home exercise*: remove also the most frequent words ("stopwords").

## Building the Distributional Semantic Model

In [9]:
# Here we write the functions that we use to build the Distributional Semantic Model.

# This function is used to build the co-occurrence matrix
def calculate_cooccurrences(tokenized, target_words, window):
    """
    This is a "docstring", a comment explaining how a function works.
    Given a list of tokenised sentences, a vocabulary size, and a window size,
        builds a co-occurence matrix.
    @param tokenized: list of list of strings: the data.
    @param target_words: sorted list of word types in the data we're looking for.
    @window: int: how far before and after the word to look for neighbouring words. 
                    e.g. 2 means the two words before and the two words after.
    @return: a numpy matrix of dim len(targets) x len(targets). 
            Each row/column refers to a word, in the same order as target_words, 
            and the entries are their co-occurence frequencies.
    """
    vocabulary_size = len(target_words)
    # Word to index dictionary
    word2index = {w: i for i, w in enumerate(target_words)}
    
    # initialise the matrix with 0's
    matrix = np.zeros([vocabulary_size, vocabulary_size]) 
    # loop through the sentences
    for sentence in tokenized:
        # loop through the words, getting their indices in the sentence from enumerate
        for position, word in enumerate(sentence):   
            # for every word within the window
            for j in range(max(position - window, 0), min(position + window + 1, len(sentence))):
                context=sentence[j]  # the current context word
                # skip the word itself and any words not in target_words
                if j!=position and word in targets and context in target_words: 
                    # increment the relevant matrix cell by 1
                    matrix[word2index[word]][word2index[context]]+=1
    # this makes the function, when applied to its arguments, "equal to" the matrix we just built,
    # and the word2index dict
    return matrix, word2index


def get_cooccurrence(word1, word2, counts, word2index):
    """
    Returns the co-occurrence counts between two words,
        given a co-occurrence matrix and a word to index dictionary.
    """
    try:
        w1_index = word2index[word1]
    except KeyError:
        print(f"{word1} is not in the target vocabulary")
    try:
        return counts[w1_index][word2index[word2]]
    except KeyError:
        print(f"{word2} is not in the target vocabulary")



In [10]:
# We now compute the co-occurrences in our tokenized text and store them for later use
# Warning: this can be slow for a large data file.
count_matrix, w2i = calculate_cooccurrences(wordsonly, targets, 2)

In [11]:
# Let's have a look at some co-occurrences:

word_pairs = [("horse", "house"), ("next", "morning"), ("the", "and")]

for w1, w2 in word_pairs:
    print(f"'{w1}' co-occurs with '{w2}' {get_cooccurrence(w1, w2, count_matrix, w2i)} times")


'horse' co-occurs with 'house' 0.0 times
'next' co-occurs with 'morning' 26.0 times
'the' co-occurs with 'and' 1354.0 times


In [12]:
# Exercise: what do you think this will print?
print(get_cooccurrence("horse", "house", count_matrix, w2i) == get_cooccurrence("house", "horse", count_matrix, w2i))

True


Here we can apply any transformation to this matrix of counts (e.g. Pointwise Mutual Information). We leave this as an exercise!

## Distances (semantic similarity)

In [None]:
# We can compute a matrix with the cosine distance between every word
# using a built-in fuction from sklearn.metrics.pairwise:
similarities = cosine_similarity(count_matrix)
# print(similarities)


In [13]:
def get_similarities(word1, word2, similarities, word2index):
    """
    Compute the cosine similarity between word1 and word2 from pre-computed similarity matrix.
    @param word1: string.
    @param word2: string.
    @param similarities: numpy matrix of pre-computed cosine similarities between words.
    @word2index: dict from string to int, to find out which row and column belongs to which word.
    @return float: the cosine similarity, or a string if one of the words isn't in the dict.
    """
    try:
        w1_index = word2index[word1]
    except KeyError:
        return f"'{word1}' is not in the target vocabulary"
    try:
        return similarities[w1_index][word2index[word2]]
    except KeyError:
        return f"'{word2}' is not in the target vocabulary"


In [14]:
word_pairs = [("eat", "drink"), ("lamp","door"), ("lamp","drink"), ("eat", "eats"), ("fell", "fall")]

#Let's look at the similarities between these words

for w1, w2 in word_pairs:
    print(f"\nThe cosine similarity of '{w1}' and '{w2}' is \t{get_similarities(w1, w2, similarities, w2i)}")
    print(f"Freq of '{w1}': {word_frequencies.get(w1, 0)}")
    print(f"Freq of '{w2}': {word_frequencies.get(w2, 0)}")    

# Try other examples! You can add them to/replace them in word_pairs


The cosine similarity of 'eat' and 'drink' is 	0.7851863215093005
Freq of 'eat': 36
Freq of 'drink': 18

The cosine similarity of 'lamp' and 'door' is 	0.8969077260497536
Freq of 'lamp': 65
Freq of 'door': 101

The cosine similarity of 'lamp' and 'drink' is 	0.33051021754269977
Freq of 'lamp': 65
Freq of 'drink': 18

The cosine similarity of 'eat' and 'eats' is 	'eats' is not in the target vocabulary
Freq of 'eat': 36
Freq of 'eats': 1

The cosine similarity of 'fell' and 'fall' is 	0.6901739457542575
Freq of 'fell': 80
Freq of 'fall': 36


Now that you know how to compute the similarities, you can do any type of analyses.
For example, correlate word similarities predicted by this model with human similarity judgements,  find the closest neighbours to one word, compare similarities between models of books from different periods, find out if children learn similar words earlier, etc. 

## Exercise
The file thomas.txt contains child-directed speech from the Thomas corpus in CHILDES.

Re-use the previous code to build a Distributional Semantic Model for this data. The functions you should be able to just use, but other code you'll probably want to copy, paste, and edit below.
