## Gensim text retrieval semantic engine

* Gensim text retrieval semantic engine with Latent Semantic Indexing (LSA in TR).
* Dataset is https://www.kaggle.com/rmisra/news-category-dataset with 202372 entries. 

In [1]:
import json
import gensim

In [2]:
DATA_PATH = "../data/News_Category_Dataset_v2.json"
DATA_LEN = 202372

### Corpus preprocessing

In [3]:
from gensim.parsing.preprocessing import STOPWORDS

def tokenize(text):
    return [token for token in gensim.utils.simple_preprocess(text) if token not in STOPWORDS]

def iter_news(file):
    for line in open(file):
        line = json.loads(line)['headline'] + json.loads(line)['short_description']
        tokens = tokenize(line)
        yield line, tokens

In [4]:
from gensim.corpora import Dictionary
import json

class NewsCorpus(object):
    
    def __init__(self, path):
        self.path = path
        self.dictionary = Dictionary()
    
    def __iter__(self):
        for line in open(self.path):
            
            line = json.loads(line)['headline'] + json.loads(line)['short_description']
            tokens = tokenize(line)
            # yield self.dictionary.doc2bow(line['headline'].lower().split())
            # yield self.dictionary.doc2bow(json.loads(line)['headline'].lower().split())
            yield line, tokens
            

In [5]:
# stream just tokens
doc_stream = (tokens for _, tokens in iter_news(DATA_PATH))

# build dict
%time id2word_news = gensim.corpora.Dictionary(doc_stream)
print(id2word_news)

Wall time: 16.6 s
Dictionary(168877 unique tokens: ['america', 'children', 'day', 'husband', 'killed']...)


In [6]:
# ignore words that appear in less than 20 documents or more than 10% documents
id2word_news.filter_extremes(no_below=20, no_above=0.1)
print(id2word_news)

Dictionary(14366 unique tokens: ['america', 'children', 'day', 'husband', 'killed']...)


In [7]:
class NewsCorpus():
    
    def __init__(self, file, dictionary):
        self.file = file
        self.dict = dictionary
        
    def __iter__(self):
        self.titles = []
        for title, tokens in iter_news(self.file):
            self.titles.append(title)
            yield self.dict.doc2bow(tokens)
        
            
# create a stream of bag-of-words vectors
news_corpus = NewsCorpus(DATA_PATH, id2word_news)
vector = next(iter(news_corpus))

In [8]:
# store corpus
%time gensim.corpora.MmCorpus.serialize('../data/news_bow.mm', news_corpus)

# store dictionary
id2word_news.save('../data/news.dict')

Wall time: 18.7 s


In [9]:
# load dictionary
id2word_news = gensim.corpora.Dictionary.load('../data/news.dict')

# load corpus
mm_corpus = gensim.corpora.MmCorpus('../data/news_bow.mm')
print(mm_corpus)

MmCorpus(200853 documents, 14366 features, 2548042 non-zero entries)


In [10]:
%time lsi = gensim.models.lsimodel.LsiModel(corpus=mm_corpus, id2word=id2word_news)

Wall time: 46.6 s


In [11]:
lsi.save('../data/lsi_news.model')

In [12]:
lsi = lsi.load('../data/lsi_news.model')

In [13]:
# build the index
from gensim import similarities
%time index = similarities.Similarity('../data/', lsi[mm_corpus], num_features = lsi.num_topics)
#%time index = similarities.Similarity(lsi[mm_corpus])
index.save('../data/lsi_news.index')

Wall time: 37.7 s


## Search example

In [14]:
import gensim

lsi = gensim.models.lsimodel.LsiModel.load('../data/lsi_news.model')
index = gensim.similarities.Similarity.load('../data/lsi_news.index')
dictionary = gensim.corpora.Dictionary.load('../data/news.dict')

In [15]:
# transform doc into lsi vector space (we need the model for this)
doc = "disaster"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]

# query doc (we need the index for this)
sims = index[vec_lsi]

# sort by similarity
sims = sorted(enumerate(sims), key=lambda item: -item[1])

In [16]:
# print most similar docs
print(sims[:10])

[(19424, 0.7097067), (17114, 0.7094732), (17451, 0.68108934), (78505, 0.6792595), (45511, 0.6750591), (46051, 0.6609052), (50149, 0.6568615), (175384, 0.6479787), (17197, 0.647551), (78660, 0.64356035)]


In [17]:
# get documents. gensim leaves the metadata of the corpus element to the user.
# this is helpful for database access of the document, but for text files we need
# to have a dict in memory or just go line by line of the file.

import json
ids = [i[0] for i in sims[:10]]

fp = open(DATA_PATH)
for i, line in enumerate(fp):
    if i in ids:
        line = json.loads(line)
        line['headline']
        line['short_description']
        print("{}\n\t{}".format(line['headline'], line['short_description']))
        
    if i > max(ids):
        break

A Struggling Haiti Scrambles To Prepare For Hurricane Irma
	Barreling through the Caribbean, the “extremely dangerous” core of Irma was predicted to strike northern Haiti.
Hail The Unsung Heroes Of Hurricane Harvey
	I am frustrated by the lack of understanding that undermines the efforts of many dedicated staff and volunteers in the nonprofit
Irma Intensifies Into Category 3 Hurricane
	The hurricane is over the eastern Atlantic and headed toward the Caribbean.
Chronicling A Forgotten Disaster: Hurricane Matthew, 10 Months Later
	Matthew was destined to be a forgotten disaster, in the shadow of Haiti’s 2010 earthquake and overshadowed by the U.S. elections.
U.S. Suspends Deportations Of Haitians After Hurricane Matthew
	A brief reprieve.
'Evil Sodomites' Now Being Blamed For Hurricane Matthew
	🙃
Hurricane Katrina Survivors Relive Familiar Nightmare In Baton Rouge
	The recent flooding in southern Louisiana has brought back horrific memories for some residents.
Fate Of Cargo Ship Unknown 