## Project 1, Milestone 1

Given a corpus of text, we project the documents in the corpus to a dense, low-dimensional space and compute the cosine similarity between all pairs of documents. We then inspect the distribution of similarities and choose an appropriate similarity threshold. We then construct a graph of documents, where a pair of documens will have an edge between them if the similarity between them exceed the threshold. Finally we save the matrix as a Numpy serialized file for the next milestone.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import spacy

from scipy.sparse import coo_matrix, save_npz

%matplotlib inline

In [None]:
DATA_DIR = "../../data/project-1"

ABSTRACT_FILE = os.path.join(DATA_DIR, "stat-abstracts.tsv")

ABS_VEC_FILE = os.path.join(DATA_DIR, "stat-abstract-vectors.tsv")
DOCIDS_LIST = os.path.join(DATA_DIR, "stat-av-docids.txt")

SIM_MATRIX_FILE = os.path.join(DATA_DIR, "av-simmatrix.npy")
ADJ_MATRIX_FILE = os.path.join(DATA_DIR, "av-adjmatrix.npz")

### 1. Load the SpaCy language model

We will use SpaCy's medium English language model called `en_core_web_md`. SpaCy does not load any languge models along with the library. If you see errors in the next cell complaining about invalid models, you will need to load it from the command line using the following command. See [SpaCy Models and Languages](https://spacy.io/usage/models) for more information.

`python -m spacy download en_core_web_md`

Once the language model is loaded successfully, you would need to restart the Jupyter notebook and relaunch this notebook.

In [None]:
nlp = spacy.load("en_core_web_md")

### 2. Extract Document Vectors

We are provided with a TSV file containing our corpus of papers (indicated by `ABSTRACT_TEXT`). Each line contains the `docID`, the paper `title`, a set of semi-colon separated `categories`, and the text of the `abstract`.

In the `join_title_text` function below, prepend the title as the first sentence of the abstract. You can do this by adding a period to the title and joining it to the beginning of the abstract text.

In [None]:
def join_title_text(title, text):
    joined_text = None
    ## your code goes here
    ## Hint: join the title + ". " + abstract_text

    ## end of your code goes here
    return joined_text


example_joined_text = join_title_text("this is a title", "this is an abstract.")
example_joined_text

The `vectorize_text` function takes a block of text and a reference to the language model and returns a 300 dimensional vector. Complete the `vectorize_text` function below. If you are unsure of what function to use, refer to the [SpaCy Linguistic Features](https://spacy.io/usage/linguistic-features#vectors-similarity) page.

In [None]:
def vectorize_text(text, nlp):
    vec = None
    ## your code goes here
    ## Hint: passing the text to the language model returns a document object,
    ## which has a .vector attribute that will produce the vecotr

    ## end of your code goes here
    return vec


example_vector = vectorize_text(example_joined_text, nlp)
example_vector.shape

We will now read each line of the input file indicated by `ABSTRACT_FILE`, generate vectors using the (`title`, `abstract`) pairs, and write out the `docID` and generated `vector` as a string in the file indicated by `ABS_VEC_FILE`.

**NOTE: This is a time consuming operation, so if you would rather skip it, you can download the `stat-abstract-vectors.tsv.backup` from the code repository for this liveProject and overwrite or replace the output of this step with it. To do this, go to your data folder and run the following commands:**

```
wget http://download.location/.../stat-abstract-vectors.tsv.backup .
mv stat-abstract-vectors.tsv.backup stat-abstract-vectors.tsv
```

In [None]:
num_lines = 0
fabs = open(ABSTRACT_FILE, "r")
fchk = open(ABS_VEC_FILE, "w")
for line in fabs:
    if num_lines % 10000 == 0:
        print("{:d} docs vectorized".format(num_lines))
    doc_id, title, categories, abs_text = line.strip().split('\t')
    vec_str = None
    ## your code goes here
    ## Generate the vector from the title and abstract, and stringify the vector elements
    ## into vec_str. Each element should be represented in floating point or exponential 
    ## notation to 3 decimal places, and elements should be joined using commas (,).
    ## Hint: consider using the {:.3e} format strint to convert each float element to string

    ## end of your code goes here
    fchk.write("{:s}\t{:s}\n".format(doc_id, vec_str))
    num_lines += 1

print("{:d} docs vectorized, COMPLETE".format(num_lines))
fchk.close()
fabs.close()

### 3. Construct a dense document matrix

Use the file of vectors you just generated (indicated by `ABS_VEC_FILE`) and create a list of document IDs (`docids`) and a list of vectors (`vecs`). Each element of the list `vecs` must be a numpy array of size (300,).

The next cell displays the size of the `docids` and `vecs` list. Verify that they are identical. Also verify that the elements of the list `vecs` are numpy vectors of size (300,).

In [None]:
docids, vecs = [], []
with open(ABS_VEC_FILE, "r") as fav:
    num_recs = 0
    for line in fav:
        if num_recs % 10000 == 0:
            print("{:d} vectors read".format(num_recs))
        doc_id, vec_str = line.strip().split('\t')
        ## your code goes here
        ## Hint: split the vec_str by "," then cast each str element into float
        ##       Also remember to wrap the list with np.array() to produce a numpy
        ##       vector
        
        ## end of your code goes here

print("{:d} vectors read, COMPLETE".format(num_recs))

In [None]:
print("number of docIDs:", len(docids))
print("number of vectors:", len(vecs))
print("shape of vector in vecs:", vecs[0].shape)

We now convert the list of (300,) vectors into a matrix of documents. There are 50426 documents in the corpus, so verify that the shape of the matrix is (50426, 300).

In [None]:
X = np.array(vecs)
X.shape

### 4. Construct a document similarity matrix

We use cosine similarity as our similarity measure. The formula for cosine similarity between a pair of documents vectors $d_1$ and $d_2$ is given by:

$$cosim(d_1, d_2) = \frac{d_1 \cdot d_2}{{\lvert d_1 \rvert}_2 {\lvert d_2 \rvert}_2}$$

We can also compute the cosine similarity between all pairs of documents in the document matrix X using the following formula.

$$S = \frac{X \cdot X^T}{{\lvert X \rvert}_2^2}$$

Since the denominator on the RHS is a constant, we can re-state the equation above as:

$$S \propto X \cdot X^T$$



**NOTE: this is a time consuming operation. If you would rather skip this, please download the serialized version of the similarity matrix from the code repository for this liveProject and load it into the variable `S` by uncommenting the commented code in the next cell instead (the one containing S = np.load(SIM_MATRIX_FILE).**

```
wget http://download.location/.../av-simmatrix.npy .
```

In [None]:
S = np.dot(X, X.T)
S.shape

In [None]:
# S = np.load(SIM_MATRIX_FILE)
# S.shape

### 5. Determine Similarity Threshold

We sample around 1000 elements from the similarity matrix and plot a histogram to get an idea of the distribution of cosine similarity scores. 

We want to draw edges only between documents with relatively high similarity. Based on the histogram, a good threshold seems to be 9.5.

In [None]:
row_indices = np.random.randint(0, S.shape[0], 1000)
col_indices = np.random.randint(0, S.shape[1], 1000)
samples = []
for row, col in zip(row_indices, col_indices):
    samples.append(S[row, col])

plt.hist(samples)

### 6. Create Adjacency Matrix

We can now create an adjacency matrix `A` from the similarity matrix `S`. An adjacency matrix is a square matrix of the same size as the similarity matrix, i.e. an (N, N) matrix where N is the number of documents. An element `A[i, j]` is 1 if there is high similarity between $doc_i$ and $doc_j$, i.e. similarity above the threshold.

Also remember to set the diagonal elements of the adjacency matrix to 0. For similarity matrices, the highest values are on the diagonal, since a document is most similar to itself. However, that would translate to self-loops in a graph, which we don't care about.

In [None]:
threshold = 9.5

A = np.zeros(S.shape)
ones_indices = S >= threshold
A[ones_indices] = 1

np.fill_diagonal(A, 0)

### 7. Save Adjacency Matrix and DocIDs list

Our adjacency matrix `A` is now sparse, so you can either save it directly using `np.save()` or convert it to a SciPy sparse COO Matrix using `coo_matrix(A.astype(np.int8)` and then save it using `save_npz()`. Saving it as a sparse matrix is recommended, since it will take about the same time to save to disk, but will result in a much smaller disk image.

In [None]:
# np.save(ADJ_MATRIX_FILE, A)
save_npz(ADJ_MATRIX_FILE, coo_matrix(A.astype(np.int8)))

In [None]:
fdocs = open(DOCIDS_LIST, "w")
for i, doc_id in enumerate(docids):
    fdocs.write("{:s}\t{:d}\n".format(doc_id, i))
fdocs.close()