# BOW and TF-IDF with gensim

What are the most common terms per document?

In [9]:
text = """
Abilify 30 mg tablet	1 (one) tablet by mouth daily
Accu-Chek Aviva Plus test strips	4 (four) test strips miscellaneous daily as needed for E11.65
Accu-Chek FastClix	1 (one) lancet miscellaneous three times a day
Accu-Chek Multiclix Lancet	4 (four) lancets miscellaneous daily for ninety day(s) as needed for E11.65
Accu-Chek Nano	1 (one) meter kit miscellaneous one time only
Accu-Chek SmartView Test Strips	3 (three) test strips miscellaneous daily
acetaminophen 300 mg-codeine 30 mg tablet (Also Known As Tylenol-Codeine #3)	1 (one) tablet by mouth three times a day as needed for pain
acetaminophen 325 mg tablet	1 (one) tablet by mouth every 4 hours for ten days as needed for pain
acetaminophen 325 mg tablet	2 (two) tablets by mouth every 8 hours for fourteen days
acetaminophen 500 mg tablet	2 (two) tablets by mouth every 8 hours for seven days
acetaminophen 500 mg tablet	2 (two) tablets by mouth every 8 hours for two days as needed for pain
acetazolamide 250 mg tablet	1 (one) tablet by mouth three times a day
acyclovir 400 mg tablet (Also Known As Zovirax)	1 (one) tablet by mouth daily
acyclovir 400 mg tablet (Also Known As Zovirax)	1 (one) tablet by mouth twice a day
acyclovir 400 mg tablet (Also Known As Zovirax)	1 (one) tablet by mouth twice a day for fourteen days
"""

In [10]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize, sent_tokenize

In [24]:
documents = text.split('\n')

In [25]:
tokenized = [word_tokenize(doc.lower()) for doc in documents]

In [26]:
tokenized[:2]

[[],
 ['abilify',
  '30',
  'mg',
  'tablet',
  '1',
  '(',
  'one',
  ')',
  'tablet',
  'by',
  'mouth',
  'daily']]

## Create a gensim dictionary

In [27]:
dict = Dictionary(tokenized)

What is the token id in the dictionary for Acetaminophen?

In [29]:
dict.token2id.get("acetaminophen")

42

## Create a gensim corpus

In [30]:
corpus = [dict.doc2bow(doc) for doc in tokenized]

In [34]:
corpus[1]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 2)]

## Bag-of-words with `gensim`

Sort the 5th document in our corpus by word frequency

In [40]:
bow = sorted(corpus[4], key = lambda w:w[1], reverse=True)
bow

[(0, 2),
 (1, 2),
 (16, 2),
 (6, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (15, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (24, 1),
 (26, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1)]

Print the top 5 words of the 5th document w/ the count of each word

In [41]:
for word_id, word_count in bow[:6]:
    print(dict.get(word_id), word_count)

( 2
) 2
for 2
daily 1
4 1
accu-chek 1


How many times does each word appear in our *entire* corpus?

In [49]:
import itertools
from collections import defaultdict

total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

In [46]:
# Sort the total_word_count w/ most common words first
sorted_word_count = sorted(total_word_count.items(), key = lambda w:w[1], reverse = True)

In [50]:
for word_id, word_count in sorted_word_count[:6]:
    print(dict.get(word_id), word_count)

( 20
) 20
tablet 17
for 11
by 10
mg 10


## TF-IDF

Determine which words in the corpus are important. Downweight less important words, e.g. common words such as mg, the, by.

In [51]:
from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)

tfidf[corpus[1]]

[(0, 0.030431918490774385),
 (1, 0.030431918490774385),
 (2, 0.15463304814088327),
 (3, 0.5203314451468743),
 (4, 0.6888618767345641),
 (5, 0.12901590119992506),
 (6, 0.2975463327876148),
 (7, 0.12901590119992506),
 (8, 0.12901590119992506),
 (9, 0.15463304814088327),
 (10, 0.25803180239985013)]

In [54]:
for word_id, score in tfidf[corpus[1]]:
    print(dict.get(word_id), score)

( 0.030431918490774385
) 0.030431918490774385
1 0.15463304814088327
30 0.5203314451468743
abilify 0.6888618767345641
by 0.12901590119992506
daily 0.2975463327876148
mg 0.12901590119992506
mouth 0.12901590119992506
one 0.15463304814088327
tablet 0.25803180239985013


As we would expect, "abilify" being the more unique of the words in this script has the highest tf-idf score.