# General Corpus: Comparing the Number of Topics Assigned to the Topic Model
A number of researchers have suggested that one of the limitations of LDA is that it cannot identify how many topics are in a corpus, leaving this decision to the human user (Yau et al., 2014 and Suominen and Toianen, 2016). Indeed, there is no way to identify the "correct" number of topics in advance of building the topic model (Carter et al., 2016). If the user specified too few topics for the model, then the topics will be too general and useless for exploratory analysis or data retrieval. By contrast, if the user specifies too many topics for the model, the topics will be too specific, or duplicated, to be of use; also, too many topics make the interpretation of the model unwieldy. Therefore, most users will experiment with the number of topics and make qualitative evaluations about which is most useful (Chen et al., 2016). Ultimately the right choice about the number of topics is dependent upon the way in which the model is going to be used (Carter, et al., 2016). As such, the ratio of documents (n) in a corpus to topics (k) to be extracted from the corpus ranges widely. Just to provide a few examples:
* Topic modeling a mental health internet support group: n = 131,000; k = 25 (Bradley et al., 2016)
* Topic modeling corporate sustainability reports: n = 9,514; k = 70 (Szekely and von Brocke, 2017)
* Topic modeling literature about adolescent substance abuse and depression: n = 17,723; k = 20 (Chen et al., 2016)
* Topic modeling legal documents: n = 7,476; k = 10 and 50 (Carter et al., 2016)
* Topic modeling scientific journal articles: n = 89,730; k = 50 (Yau et al. 2014)

Here, I evaluate topic models on the general corpus which were specified different values for topics to be extracted. One model has 100 topics, another has 250 topics, and the third has 500 topics. These models are all based on the "general corpus"; that is to say, the features of these models may be any part of speech (although the dominate words in most topics are nouns). Additionally, the topic models discussed here have alpha set to `gensim`'s default: `alpha='symmetric'`.

Each topic will be compared to the others in terms of its ability to
1. produce coherent topics
2. cluster the documents in the corpus
3. retrieve documents in the corpus which match new, unseen documents.

## Imports

In [1]:
from gensim import corpora, models, similarities
import pyLDAvis.gensim
import json
import spacy
#from operator import itemgetter
#import pandas as pd

## Set up

In [2]:
path = '../general_corpus/'
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
    doc2metadata = json.load(f)

## Topic Coherence
An important measure of a topic model is whether or not the topics it produces are coherent, whether that coherence is semantic or contextual. By semantic coherence, I mean that the most dominant terms in the topic all fall in the same semantic range; by contextual coherence I mean that given the context of the document (i.e. a scholarly journal of biblical literature) the dominant words in the topic are related. While this measure is subjective, it often proves more meaningful than traditional metrics of topic coherence (Chang, et al., 2009).

For the purposes of this project, a topic is considered coherent if at least 9 out of its 10 most prominent terms fall within the same semantic or contextual range.

NB: be sure to include a note on reading the visualization

### Topic Coherence: General 100 Topics, alpha = 'symmetric'

In [3]:
dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
corpus = corpora.MmCorpus(path + 'general_corpus.mm')
general_100 = models.ldamodel.LdaModel.load(path + 'general_100.model')
general100_viz = pyLDAvis.gensim.prepare(general_100, corpus, dictionary)
pyLDAvis.display(general100_viz)

The `general_100` model produced 53 coherent topics out of 100. Here are a few examples of coherent topics to which I have also attached a possible label:
* **topic 26 (family and gender)**: woman, child, male, wife, marriage, female, mother, sexual, husband, family
* **topic 35 (rabbinic literature)**: rabbinic, judaism, rabbi, synagogue, midrash, ben, talmud, mishnah, period, neusner
* **topic 46 (narrative criticism)**: narrative, reader, character, speech, tobit, audience, narrator, discourse, voice, reading
* **topic 52 (hellenistic philosophy)**: philo, greek, hellenistic, philosophy, plato, soul, pseudo, stoic, plutarch, philosophical
* **topic 59 (miracles and spirituality)**: spirit, power, miracle, holy, demon, divine, satan, magic, evil, spiritual
* **topic 72 (scripture and canon)**: scripture, torah, oral, canon, canonical, writing, authority, bible, sanders, process
There may be outliers in the most prominent words for some of these topics (such as "tobit" in topic 46) but on the whole these words fit together.

This model has also distinguished between similar categories. The following three topics could all be associated with the discipline of textual criticism but the model provided some nuance:
* **topic 31 (text criticism, general)**: reading, manuscript, textual, variant, group, mss, codex, witness, western, type
* **topic 55 (text criticism, hebrew bible)**: lxx, mt, reading, line, translator, samaritan, textual, varaint, version, greek
* **topic 65 (papyrology)**: codex, copy, line, ms, fragment, manuscript, papyrus, library, column, cent

The above mentioned topics are examples, but not exhaustive of the coherent topics produced by this model. However, there were a number of topics which lack coherence and may be dismissed as "junk topics:"
* **topic 41**: shall, original, hand, son, father, connection, yahweh, stand, phrase, house
* **topic 92**: hfl, stud, vaguely, sketchy, tribution, suited, evade, pression, overshadow, cogently

Finally, there are a few topics when may be coherent, but are too general to distinguish between documents. Topic 3 is an example:
* **topic 3**: theology, ot, nt, deal, reader, treatment, exegesis, method, criticism, contribution

It should be noted that if topic 3 appeared in a topic model which represented an interdisciplinary corpus, it may in fact be a useful topic. However, since all of these documents pertain to the discipline of biblical studies, this topic is not informative.


### Topic Coherence: General 250 Topics, alpha = 'symmetric'

In [9]:
dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
corpus = corpora.MmCorpus(path + 'general_corpus.mm')
general_250 = models.ldamodel.LdaModel.load(path + 'alpha_symmetric/general250.model')
general_250_viz = pyLDAvis.gensim.prepare(general_250, corpus, dictionary)
pyLDAvis.display(general_250_viz)

FileNotFoundError: [Errno 2] No such file or directory: '../general_corpus/alpha_symmetric/general250.model'

The `general_250` model provided 108 coherent topics out of 250 topics (43% of the topics are coherent). Many of the topics provided by the `general_100` model are also present in the `general_250` model (with slight variation), such as:
* **topic 14 (text criticism, general)**: reading, manuscript, textual, variant, codex, group, mss, witness, greek, western
* **topic 19 (narrative criticism)**: narrative, story, reader, character, literary, narrator, scene, episode, reading, tale
Topic 14 provides a good illustration of contextual rather than semantic coherence. The word "western" is not in the same semantic range as reading, manuscript, and textual. However, in the context of text criticism, 'western' refers to a family of manuscripts. "Western" is therefore contextually coherent with the other terms. It is also worth noting that "reading" appears in each list. The model has done a good job of recognizing (via patterns of co-occurrence) that reading has different nuances. In topic 14 "reading" refers to the text preserved by a manuscript whereas in topic 19 "reading" refers to the act of drawing meaning out from a series of words.

There are also a few topics produced by the `general_250` model which add a helpful level of nuance such as topic 96:
* **topic 96 (documentary hypothesis)**: p, j, pentateuch, priestly, wellhausen, e, yahweh, je, narrative, document

This topic may seem incoherent, but a biblical scholar will recognize each word (including the abbreviations and the name "Wellhausen") as being key terminology for a theory called the documentary hypothesis (that the Pentateuch was written at four key stages by four separate authors called "J", "E", "D" and "P."

This model also provided nuance to the family and gender topic from the `general_100` model by distinguishing between words which belong more to the semantic range of family and those which belong more to the semantic range of gender:
* **topic 63 (family)**: child, wife, marriage, mother, woman, husband, daughter, sexual, birth, sister
* **topic 92 (gender)**: woman, male, female, gender, feminist, women, role, sexual, sex, feminine

### Topic Coherence: General 500 Topics, alpha = 'symmetric'

In [9]:
dictionary = corpora.Dictionary.load(path + 'general_corpus.dict')
corpus = corpora.MmCorpus(path + 'general_corpus.mm')
general_500 = models.ldamodel.LdaModel.load(path + 'alpha_symmetric/general500.model')
general_500_viz = pyLDAvis.gensim.prepare(general_500, corpus, dictionary)
pyLDAvis.display(general_500_viz)

Finally, the `general_500` model produced 135 coherent topics out of 500 topics (27% of the topics are coherent). Several of the topics have presisted throughout each model such as narrative criticism:
* **topic 42 (narrative criticism)**: narrative, story, reader, character, narrator, literary episode, event, scene, tell

This topic has also added nuance with new topics that expand open existing ones. For example, there is a new topic which is generally related to textual criticism which could be labled "scribal activity":
* **topic 147 (scribal activity)**: omit, add, omission, gloss, insert, scribe, addition, change, editor, marginal

However, the `general_500` model has produced additional coherent topics which are not found in the previous models:
* **topic 121 (church fathers)**: origen, justin, clement, irenaeus, christian, tertullian, hippolytus, augustine patristic, alexandria
* **topic 171 (myth)**: myth, mythology, mythological, mythic, pattern, combat, story, ritual, reality, figure
* **topic 179 (liturgy)**: hymn, doxology, liturgical, praise, liturgy, thanksgiving, andrew, amen, hymnic, didactic

### Topic Coherence: Discussion
A certain amount of trial and error is necessary when figuring out how many topics to specify for a topc model. At the end of the day, it is necessay to find a balance between generality and specificity for the model (Szekely and vom Brocke, 2017). In the case of topic modeling the *JBL*, assigning fewer topics to the model led to a higher percentage of coherent topics, but fewer coherent topics over all: for the `general_100` model 53/100 topics were coherent (53%), for the `general_250` model 108/250 topics were coherent (43%) and for the `general_500` model 135/500 topics wee coherent (27%). Of the coherent topics produced by each model, those produced by the `general_100` model were far more general than those produced by the `general_500`. I would suggest that if one wants to use this model to explore what the *JBL* is about, the `general_500` does a better job of giving those details, despite the fact that its ratio of meaningful topics to the total number of topics is not as good as the other models.

I would also suggest, given the specific topics mentioned above, that it is helpful to have someone with subject expertese evaluate the topics. The lay reader may have dismissed topic 96 from the `general_250` as a junk topic. However, it turned out to be a very specific topic in the field of biblical studies.

## Clustering
One of the applications for this topic model is to use it for clustering articles from the *JBL*. A way of evaluating the model is to ask, in a big picture perspective, how much of the corpus is it able to cluster? Here I have defined a function which identifies what percentage of the corpus the model is able to cluster into one topic, into multiple topics, or into no topics at all. The threshold I have decided to use is that a document must have a minimum threshold of 20% to belong to a topic. That is to say, 20% of the words used in a document must come from a particular topic if that document is to be clustered with other documents in that topic.

In [11]:
def doc_assignment_test(corpus, model):
    docs_with_1_topic = 0
    docs_with_multiple_topics = 0
    docs_with_no_topics = 0
    total_docs = 0
    for doc in corpus:
        topics = model.get_document_topics(doc, minimum_probability=0.20)
        total_docs += 1
        if len(topics) == 1:
            docs_with_1_topic += 1
        elif len(topics) > 1:
            docs_with_multiple_topics += 1
        else:
            docs_with_no_topics += 1
    print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
    print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
    print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')

### Clustering: General 100 Topics, alpha = 'symmetric'

In [10]:
doc_assignment_test(corpus, general100)

Corpus assigned to a single topic: 57.53102267864784 %
Corpus assigned to multiple topics: 26.839965768078734 %
corpus assigned to no topics: 15.629011553273427 %


the general 100 topic model was able to assign 57.53% of the corpus to one topic and 26.83% of the corpus to multiple topics. This totals up to 84.34 of the corpus being associated with at least 1 topic with a probability of 20% or more. Put another way, for nearly 8,000 documents in this corpus, at least 20% of that document's content is accounted for by one of the topics in this model. However, this leaves 15.62% of the corpus without significant association with any topic.

### Clustering: General 250 Topics, alpha = 'symmetric'

In [13]:
doc_assignment_test(corpus, general_250)

Corpus assigned to a single topic: 54.07573812580231 %
Corpus assigned to multiple topics: 13.425331621737268 %
corpus assigned to no topics: 32.49893025246042 %


The general 250 topic model was able to assign 54.07% of the corpus to one topic and 13.42% of the corpus to multiple topics. This totals up to 67.49% of the corpus being associated with at least 1 topic with a probability of 20% or more. Put another way, for just over 6,000 documents in this corpus, at least 20% of that document's content is accounted for by one of the topics in this model. However, this leaves 32.49% of the corpus without significant association with any topic.

### Clustering: General 500 Topics, alpha = 'symmetric'

In [12]:
doc_assignment_test(corpus, general_500)

Corpus assigned to a single topic: 47.5609756097561 %
Corpus assigned to multiple topics: 8.194266153187847 %
corpus assigned to no topics: 44.244758237056054 %


## Information Retrieval
A final way of evaluating this model is to see if it is able to provide useful information retrieval. To evaluate the model in this way, it will be given abstracts from a few different articles from more recent issues of *JBL* which it has not yet seen. Then, it will be evaluated on whether or not it is able to return similar articles from the corpus it has seen. Abstracts are taken from the following articles:
* Greene, N. E. (2017). Creation, destruction, and a Psalmist's plea: rethinking the poetic structure of Psalm 74. *Journal Of Biblical Literature*, 136 (1), 85-101. doi:10.15699/jbl.1361.2017.156672
* Hollenback, G. M. (2017). Who Is Doing What to Whom Revisited: Another Look at Leviticus 18:22 and 20:13. *Journal Of Biblical Literature*, 136 (3), 529-537. doi:10.15699/jbl.1363.2017.161166
* Dinkler, M. B. (2017). Building Character on the Road to Emmaus: Lukan Characterization in Contemporary Literary Perspective. *Journal Of Biblical Literature*, 136(3), 687-706. doi:10.15699/jbl.1363.2017.292918

In [18]:
index_100 = similarities.MatrixSimilarity(general_100[corpus])  # build index for similarity queries
index_250 = similarities.MatrixSimilarity(general_250[corpus])  # build index for similarity queries
index_500 = similarities.MatrixSimilarity(general_500[corpus])  # build index for similarity queries

In [61]:
def retrieval_test(new_doc, lda, index):
    new_bow = dictionary.doc2bow(new_doc)  # change new document to bag of words representation
    new_vec = lda[new_bow]  # change new bag of words to a vector
    index.num_best = 5  # set index to generate 5 best results
    matches = (index[new_vec])
    for match in matches:
        score = str(match[1])
        key = 'doc_' + str(match[0])
        article_dict = doc2metadata[key]
        author = article_dict['author']
        title = article_dict['title']
        year = article_dict['pub_year']
        print(author + ' "' + title + '" ' + year + '\n\tsimilarity score -> ' + score + '\n')

In [38]:
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS

def get_lemmas(text):
    doc = nlp(text)
    tokens = [token for token in doc]
    lemmas = [token.lemma_ for token in tokens if token.is_alpha]
    lemmas = [lemma for lemma in lemmas if lemma not in stop_words]
    return lemmas

#### Load documents

In [39]:
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
    doc2metadata = json.load(f)

In [40]:
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    greene = get_lemmas(text)

In [41]:
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    hollenback = get_lemmas(text)

In [42]:
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
    text = f.read()
    dinkler = get_lemmas(text)

### Finding articles similar to Greene, N. E. (2017). Creation, destruction, and a Psalmist's plea: rethinking the poetic structure of Psalm 74

#### General 100 Topics, alpha = 'symmetric'

In [71]:
retrieval_test(greene, general_100, index_100)

Gillingham, S. "review of the message of the psalter: an eschatological programme in the book of psalms" 1999
	similarity score -> 0.8859902620315552

Miller, Patrick D. "review of die komposition des psalters: ein formgeschichtlicher ansatz" 1997
	similarity score -> 0.8664625883102417

Malchow, Bruce V. "review of psalm 102 im kontext des vierten psalmenbuches" 1997
	similarity score -> 0.8636935353279114

Limburg, James "review of  jahwe wird kommen, zu herrschen über die erde: ps 90-110 als komposition " 1997
	similarity score -> 0.8316084146499634

Jerome F. D. Creach "review of the songs of ascents (psalms 120-134): their place in israelite history and religion" 1999
	similarity score -> 0.8286547660827637



#### General 250 Topics, alpha = 'symmetric'

In [63]:
retrieval_test(greene, general_250, index_250)

Miller, Patrick D. "review of die komposition des psalters: ein formgeschichtlicher ansatz" 1997
	similarity score -> 0.8384199142456055

Malchow, Bruce V. "review of psalm 102 im kontext des vierten psalmenbuches" 1997
	similarity score -> 0.8353039026260376

Creach, Jerome F. "review of between sheol and temple: motif structure and function in the i-psalms" 1997
	similarity score -> 0.8258504867553711

Jerome F. D. Creach "review of the songs of ascents (psalms 120-134): their place in israelite history and religion" 1999
	similarity score -> 0.823907732963562

Mays, James L. "review of the psalms of the sons of korah" 1985
	similarity score -> 0.8172439336776733



#### General 500 Topics alpha, = 'symmetric'

In [64]:
retrieval_test(greene, general_500, index_500)

Balentine, Samuel E. "review of the cry to god in the old testament" 1990
	similarity score -> 0.7842912077903748

Collins, John J. "review of the self as symbolic space: constructing identity and community at qumran" 2005
	similarity score -> 0.7370615601539612

Conway, Colleen M. "review of the shining garment of the text: gendered readings of john's prologue" 2001
	similarity score -> 0.7347920536994934

Wassell, Blake E. "“fishers of humans,” the contemporary theory of metaphor, and conceptual blending theory" 2014
	similarity score -> 0.7299591302871704

Ahn, John "psalm 137: complex communal laments" 2008
	similarity score -> 0.726301372051239



### Finding articles similar to Hollenback, G. M. (2017). Who Is Doing What to Whom Revisited: Another Look at Leviticus 18:22 and 20:13.

#### General 100 Topics, alpha = 'symmetric'

In [65]:
retrieval_test(hollenback, general_100, index_100)

Bird, Phyllis A. "review of frauen im alten israel: eine begriffsgeschichtliche und sozialrechtliche studie zur stellung der frau im alten testament" 1993
	similarity score -> 0.841264009475708

Sharp, Carolyn J. "review of gender in the book of jeremiah: a feminist-literary reading" 2000
	similarity score -> 0.8202147483825684

Brawley, Robert L. "review of homoeroticism in the biblical world: a historical perspective" 2001
	similarity score -> 0.766456127166748

Pressler, Carolyn J. "review of on gendering texts: female and male voices in the hebrew bible" 1996
	similarity score -> 0.7661805152893066

Adams, Karin "metaphor and dissonance: a reinterpretation of hosea 4:13-14" 2008
	similarity score -> 0.7642098665237427



#### General 250 Topics, alpha = 'symmetric'

In [66]:
retrieval_test(hollenback, general_250, index_250)

Walsh, Jerome T. "leviticus 18:22 and 20:13: who is doing what to whom?" 2001
	similarity score -> 0.6948369145393372

Thompson, Richard P. "review of  verschwiegene jüngerinnen-vergessene zeuginnen: gebrochene konzepte im lukasevangelium " 2000
	similarity score -> 0.587742805480957

Sharp, Carolyn J. "review of gender in the book of jeremiah: a feminist-literary reading" 2000
	similarity score -> 0.5831398367881775

author not listed "review of ristabilire la giustizia: procedure, vocabulario, orientamenti" 1988
	similarity score -> 0.5541554093360901

Kraemer, Ross S. "review of in memory of her: a feminist theological reconstruction of christian origins" 1985
	similarity score -> 0.5489803552627563



#### General 500 Topics, alpha = 'symmetric'

In [67]:
retrieval_test(hollenback, general_500, index_500)

Corley, Kathleen E. "review of the double message: patterns of gender in luke-acts" 1996
	similarity score -> 0.7296997904777527

Sakenfeld, Katharine Doob "review of till the heart sings: a biblical theology of manhood and womanhood" 1987
	similarity score -> 0.692420482635498

Bird, Phyllis A. "review of frauen im alten israel: eine begriffsgeschichtliche und sozialrechtliche studie zur stellung der frau im alten testament" 1993
	similarity score -> 0.6910187602043152

De George, Susan G. "review of women in the ministry of jesus: a study of jesus' attitudes to women and their roles as reflected in his earthly life" 1986
	similarity score -> 0.6893328428268433

Karris, Robert J. "review of choosing the better part? women in the gospel of luke" 1998
	similarity score -> 0.6882470846176147



### Finding articles similar to Dinkler, M. B. (2017). Building Character on the Road to Emmaus: Lukan Characterization in Contemporary Literary Perspective.

#### General 100 Topics, alpha = 'symmetric'

In [68]:
retrieval_test(dinkler, general_100, index_100)

Landry, David "review of jesus the intercessor: prayer and christology in luke-acts" 1995
	similarity score -> 0.9301128387451172

Tyson, Joseph B. "review of the lukan voice: confusion and irony in the gospel of luke" 1988
	similarity score -> 0.9137783646583557

Darr, John A. "review of host, guest, enemy and friend: portraits of the pharisees in luke and acts" 1993
	similarity score -> 0.8895664811134338

Cousland, J. R. C. "review of the controversy stories in the gospel of matthew: their redaction, form and relevance for the relationship between the matthean community and formative judaism" 2003
	similarity score -> 0.8889796137809753

Kysar, Robert "review of jesus and the samaritan woman: a speech act reading of john 4:1-42" 1993
	similarity score -> 0.8823515772819519



#### General 250 Topics, alpha = 'symmetric'

In [69]:
retrieval_test(dinkler, general_250, index_250)

Darr, John A. "review of host, guest, enemy and friend: portraits of the pharisees in luke and acts" 1993
	similarity score -> 0.8710132837295532

Campbell, William Sanger "review of "but it is not so among you": echoes of power in mark 10:32-45" 2005
	similarity score -> 0.8654212951660156

Driggers, Ira Brent "review of reading mark: engaging the gospel" 2004
	similarity score -> 0.8547635078430176

Gladson, Jerry A. "review of reliable characters in the primary history: profiles of moses, joshua, elijah and elisha" 1998
	similarity score -> 0.8543339371681213

Hobbs, T. R. "review of of methods, monarchs, and meanings: a sociorhetorical approach to exegesis" 1998
	similarity score -> 0.8247799277305603



#### General 500 Topics, alpha = 'symmetric'

In [70]:
retrieval_test(dinkler, general_500, index_500)

Gladson, Jerry A. "review of reliable characters in the primary history: profiles of moses, joshua, elijah and elisha" 1998
	similarity score -> 0.9376718401908875

Nolland, John "review of  den anfang hören: leserorientierte evangelienexegese am beispiel von matthäus 1-2 " 2000
	similarity score -> 0.9328693747520447

Kissling, Paul J. "review of the prostitute and the prophet: hosea's marriage in literary-theoretical perspective" 1998
	similarity score -> 0.9214377999305725

Darr, John A. "review of host, guest, enemy and friend: portraits of the pharisees in luke and acts" 1993
	similarity score -> 0.913260817527771

McLaughlin, John L. "review of history and ideology in the old testament prophetic literature: a semiotic approach to the reconstruction of the proclamation of the historical prophets" 1998
	similarity score -> 0.902553141117096

