# Term Frequency Models

This notebook demonstrates how to use the `Corpus` objects to train and view a term frequency (TF) or "word count" model, a term frequency-inverse document frequency (tf-idf) model, the Latent Semantic Analysis (LSA) model, and view topics in the LDA models trained throught the topic explorer.

To run the notebook, use the menu `Cell -> Run All` or use the play button for one cell at a time.

In [None]:
# First we load the vsm module and import your corpus
from vsm import *
from corpus import *

## LDA Models

The LDA models are automatically imported in the dictionary `lda_v[k]` where k is the number of topics.

In [None]:
print topic_range
k_val = topic_range[-1]
v=lda_v[k_val]


In [None]:
v.topics()

# Sentence Simialrity Module

In [None]:
from vsm.extensions.ldasentences import sim_sent_sent

In [None]:
# First, a helper to find the sentence IDs
def find_sentence_ids(phrase):
    ids = []
    for sent_id, sentence in enumerate(c.sentences):
        if phrase in sentence:
            ids.append(sent_id)
            
    return ids

In [None]:
# We can use this to search for a particular phrase
find_sentence_ids('Reason has previously become')

In [None]:
# This allows us to view the sentence
print c.sentences[0]

In [None]:
# This is how you get similarity for a particular sentence
# If the distance is 0, it means that the topic distribution is identical for this model.
tok_sents, orig_sents, sim_sents = sim_sent_sent(v, 0)
sim_sents

In [None]:
# finding the whole passage for Axioms of Intuition: Part 1
# First we find the first sentence of the proof
print find_sentence_ids('All phenomena contain, as regards their form, an intuition in space and time, which lies a priori at the foundation of all without exception.')

#Then the last sentence of the first section
print find_sentence_ids('But in this case, no a priori synthetical cognition of them could be possible, consequently not through pure conceptions of space and the science which determines these conceptions, that is to say, geometry, would itself be impossible.')

In [None]:
# first print all these sentences:
print c.sentences[1664:1696]

In [None]:
c.view_contexts('sentence', as_strings=True)[1664:1697]

In [None]:
# another way that you can get a list of all the numbers between 1664 and 1696 is the "range()" function
list(range(1664,1697))

In [None]:
# looking for sentence similarity to a range of sentences
tok_sents, orig_sents, sim_sents = sim_sent_sent(v, list(range(1664, 1697)), print_len=5, min_sent_len=5)
sim_sents