# Topic Models
__Topic modeling__ is an area of NLP dedicated to uncovering hodden topics within a body of language.
<br>
<br>
A common technique is to deprioritize the most common words and prioritize less frequently used terms as topics in a process known as __term frequency-inverse document frequency (tf-idf)__. The liblaries __gensim__ and __sklearn__ have modules to handle __tf-idf__.
<br>
<br>
__Latent Dirichlet allocation (LDA)__ is a statistical model that takes your documents and determines which words keep popping up together in the same contexts. The liblary __sklearn__ has modules to handle that issue.
<br>
<br>
__Word2vec__ can map out your topic model results spatially as vectors so that similarly used words are closer together. Word-to-vector mapping are known as a word embedding.

In [2]:
from sherlock_holmes import bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3
corpus = [bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3]
print(bohemia_ch1[:300])


To Sherlock Holmes she is always THE woman. I have seldom heard
him mention her under any other name. In his eyes she eclipses
and predominates the whole of her sex. It was not that he felt
any emotion akin to love for Irene Adler. All emotions, and that
one particularly, were abhorrent to his cold


In [3]:
# Preprocessing
# - Removing punctuation
# - Tokenization = Breaking text into individual words
# - Lemmatization = Bring words down to their root forms ('are' becomes 'be')
# - Stopword removal
from preprocessing import preprocess_text
preprocessed_corpus = [preprocess_text(chapter) for chapter in corpus]
print(preprocessed_corpus)

k follow father idea front hundred yard pool hear cry cooee usual signal father hurry forward find stand pool appear much surprise see ask rather roughly conversation ensue lead high word almost blow father man violent temper see passion become ungovernable leave return towards hatherley farm go 150 yard however hear hideous outcry behind cause run back find father expire upon grind head terribly injure drop gun hold arm almost instantly expire kneel beside minute make way mr turner lodge keeper house near ask assistance saw one near father return idea come injury popular man somewhat cold forbid manner far know active enemy know nothing far matter coroner father make statement die witness mumble word could catch allusion rat coroner understand witness convey mean think delirious coroner point upon father final quarrel witness prefer answer coroner afraid must press witness really impossible tell assure nothing sad tragedy follow coroner court decide need point refusal answer prejudice

In [7]:
# Stopword removal

# But stopword list is empty
# Add some words to stop_list that don’t tell you much about the topic and then run your code again. Do this until you have at least 10 words in stop_list so that the bag of words LDA model has some interesting topics.
stop_list = ["say", "see", "holmes", "shall", "say", "man", "upon", "know", "quite", "one", "well", "could", "would", "take", "may", "think", "come", "go", "little", "must", "look"]

def filter_out_stop_words(corpus):
  no_stops_corpus = []
  for chapter in corpus:
    no_stops_chapter = " ".join([word for word in chapter.split(" ") if word not in stop_list])
    no_stops_corpus.append(no_stops_chapter)
  return no_stops_corpus

filtered_for_stops = filter_out_stop_words(preprocessed_corpus)

# creating the bag of words model
from sklearn.feature_extraction.text import CountVectorizer
bag_of_words_creator = CountVectorizer()
bag_of_words = bag_of_words_creator.fit_transform(filtered_for_stops)

# creating the tf-idf model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_creator = TfidfVectorizer(min_df = 0.2)
tfidf = tfidf_creator.fit_transform(preprocessed_corpus)

# creating the bag of words LDA model
from sklearn.decomposition import LatentDirichletAllocation
lda_bag_of_words_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_bag_of_words = lda_bag_of_words_creator.fit_transform(bag_of_words)

# creating the tf-idf LDA model
lda_tfidf_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_tfidf = lda_tfidf_creator.fit_transform(tfidf)

print("~~~ Topics found by bag of words LDA ~~~")
for topic_id, topic in enumerate(lda_bag_of_words_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([bag_of_words_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

print("\n\n~~~ Topics found by tf-idf LDA ~~~")
for topic_id, topic in enumerate(lda_tfidf_creator.components_):
  message = "Topic #{}: ".format(topic_id + 1)
  message += " ".join([tfidf_creator.get_feature_names()[i] for i in topic.argsort()[:-5 :-1]])
  print(message)

~~~ Topics found by bag of words LDA ~~~
Topic #1: note street yet make
Topic #2: make street find time
Topic #3: find mccarthy leave give
Topic #4: majesty mccarthy case word
Topic #5: time young turner run
Topic #6: photograph room street hand
Topic #7: house street make find
Topic #8: mccarthy son father case
Topic #9: son mccarthy father find
Topic #10: mccarthy father lestrade turner


~~~ Topics found by tf-idf LDA ~~~
Topic #1: one official paper help
Topic #2: pace note pity monday
Topic #3: holmes say upon man
Topic #4: show memory occur equal
Topic #5: lestrade say bit take
Topic #6: side save deadly become
Topic #7: peace youth seize interfere
Topic #8: surgeon lake factor twist
Topic #9: shadow something regent interview
Topic #10: ill lip different sink
