# compute word co-occurence matrix

This notebook constructs the word co-occurence matrix from a bag of sentences file. In particular, given a file with one sentence on each line we create a matrix **counts** which counts the number of times each pair of words in our vocabulary co-occur within the same sentence.


**code details**

The code uses the nltk package for text processing (primarily tokenizing sentences into words). nltk is both a software package and (free) textbook -- you can read more here: http://www.nltk.org/book/

The word co-occurence matrix is a very large, sparse matrix (i.e. most words do not co-occur with most other words). There are efficient data structures to handle very large sparse matrices to make working with them tractable on a laptop; we use scipy's sparse matrix libary (https://docs.scipy.org/doc/scipy/reference/sparse.html).

In [1]:
import numpy as np
from collections import Counter
from itertools import combinations, chain, islice

from nltk.tokenize import  word_tokenize
from scipy.sparse import csr_matrix, lil_matrix


# import local code files
import sys, os
sys.path.append(os.getcwd() + '/code/')

from save import save_vocabulary, save_matrix

`sentences_file` should be a file with one sentence on each line. Note the sentence file has been processed already (e.g. documents tokenized into sentences, words lower cased, etc).

In [2]:
# you should modify this path on your own computer
sentences_file = '/Users/iaincarmichael/data/word_embed/scotus/sentences/scotus_sentences_1e4.txt'

In [7]:
def sentences2counts(sentences_file, N=1000):
    """
    Counts the number of times words co-occur with each other in a sentence.
    Reads in the bag of sentence file N lines at at time to reduce the memory usage.
    
    Parameters
    ----------
    sentences_file: path to bag of sentences file
    N: how many lines to read in at a time
    
    Returns
    -------
    counts
    vocab
    
    """
    
    pair_counts = Counter()
    with open(sentences_file) as f:
        
        # go through file N lines at at time
        while True:
            sentences = list(islice(f, N)) # read in N lines

            if sentences:
                # tokenize sentences into words
                sentences_word_tok =  [[w for w in word_tokenize(s)] for s in sentences]

                # Get a list of all of the combinations
                extended = [tuple(combinations(s, 2)) for s in sentences_word_tok]
                extended = chain(*extended)

                # Sort the combinations so that A,B and B,A are treated the same
                extended = [tuple(sorted(d)) for d in extended]

                # count the combinations
                pair_counts.update(extended)
    
            else: # stop while loop if we are at the end of the file
                break
                
                
    # get vocabulary
    vocab = zip(*pair_counts.keys())
    vocab = list(set(vocab[0]).union(set(vocab[1])))
    w2i = {vocab[i]: i for i in range(len(vocab))}
    
    # construct counts as lil matrix but return them as csr_matrix
    counts = lil_matrix((len(vocab), len(vocab)))
    for c, p in enumerate(pair_counts):
        counts[w2i[p[0]], w2i[p[1]]] = c
        
    counts = (counts + counts.T).tocsr()
    
    return counts, vocab

In [None]:
%%time
counts, vocab = sentences2counts(sentences_file, 1000)

** save the word co-occurence counts and the vocabulary**

In [None]:
# save the vocabulary and the counts matrix
save_vocabulary('data/vocab_1e4.txt', vocab)
save_matrix('data/scotus_counts_1e4', counts)