<a href="https://colab.research.google.com/github/kronze1996/Automated-Q-A-System/blob/main/gensim_tfidf_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TFIDF based retrial using gensim

This notebook defines the **gensim-based document retrieval method based on tf-idf similarity score** (between corpus documents and the query string).

1. Cleanup / preprocess 
2. Define dictionary
3. Transform corpus - Bag of Worgs
4. Learn tfidf vectors for corpus
5. Sparse matrix indexing for similarity scoring
6. Retrieve top N document for the given query string

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, sys

In [None]:
from sklearn.datasets import fetch_20newsgroups
from gensim import corpora
from gensim.parsing import strip_tags, strip_numeric, \
    strip_multiple_whitespaces, stem_text, strip_punctuation, \
    remove_stopwords, preprocess_string
import pprint
import re

In [None]:
# Import modules needed for this project
!pip install pdfplumber
import pdfplumber

Collecting pdfplumber
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/4d9768e9ed204c68bd5813a2a112d3d6af4912f0785d47080b5067cdce64/pdfplumber-0.5.27.tar.gz (44kB)
[K     |████████████████████████████████| 51kB 3.4MB/s 
[?25hCollecting pdfminer.six==20200517
[?25l  Downloading https://files.pythonhosted.org/packages/b0/c0/ef1c8758bbd86edb10b5443700aac97d0ba27a9ca2e7696db8cd1fdbd5a8/pdfminer.six-20200517-py3-none-any.whl (5.6MB)
[K     |████████████████████████████████| 5.6MB 10.2MB/s 
Collecting Wand
[?25l  Downloading https://files.pythonhosted.org/packages/d7/f6/05f043c099639b9017b7244791048a4d146dfea45b41a199aed373246d50/Wand-0.6.6-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 21.0MB/s 
[?25hCollecting pycryptodome
[?25l  Downloading https://files.pythonhosted.org/packages/ad/16/9627ab0493894a11c68e46000dbcc82f578c8ff06bc2980dcd016aea9bd3/pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9MB)
[K     |██████████████████

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
infile= '/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/Automated Q_A PDFs/Applied Data Science.pdf'
pgList=[]
with pdfplumber.open(infile) as pdf:
  totalpages = len (pdf.pages)
  for i in range(0,totalpages):
    page = pdf.pages[i]
    row = page.extract_text()
    pgList.append(row)

In [None]:
# collect all text documents as list
text_docs = pgList

In [None]:
text_docs[0]

'Applied Data Science\nIan Langmore Daniel Krasner'

### Preprocess the text corpus

In [None]:
# preprocess using gensim.parsing
# ref: https://www.kaggle.com/venkatkrishnan/gensim-text-mining-techniques
transform_to_lower = lambda s: s.lower()

remove_single_char = lambda s: re.sub(r'\s+\w{1}\s+', '', s)

# Filters to be executed in pipeline
CLEAN_FILTERS = [strip_tags,
                strip_numeric,
                strip_punctuation, 
                strip_multiple_whitespaces, 
                transform_to_lower,
                remove_stopwords,
                remove_single_char]

# Method does the filtering of all the unrelevant text elements
def cleaning_pipe(document):
    # Invoking gensim.parsing.preprocess_string method with set of filters
    processed_words = preprocess_string(document, CLEAN_FILTERS)
    
    return processed_words
print(cleaning_pipe(text_docs[0]))

['applied', 'data', 'science', 'ian', 'langmore', 'daniel', 'krasner']


### Define corpus dictionary

In [None]:
def create_dictionary(docs):
    'create dictionary of words in preprocessed corpus'
    pdocs = [cleaning_pipe(doc) for doc in docs]
    dictionary = corpora.Dictionary(pdocs)
    dictionary.save('newsgroup.dict')
    return dictionary,pdocs

In [None]:
dictionary, pdocs = create_dictionary(text_docs)

In [None]:
len(dictionary)

4835

- dictionary is huge in size (177k unique words - 177k dimensions) but gensim will be able to manage it efficiently.

### Transform any sample document as per the known dictionary

In [None]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(cleaning_pipe(new_doc))
print(new_vec)

[(460, 1), (898, 1)]


### Transform complete corpus as BoW

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in pdocs]

### Fit the tfidf model a.k.a tfidf vectorizer

In [None]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

In [None]:
# transform any new document as tfidf vector
words = cleaning_pipe("want to sell bike")
print(tfidf[dictionary.doc2bow(words)])

[(721, 1.0)]


## Sparse matrix indexing for similarity scoring

In [None]:
# index the tfidf vector of corpus as sparse matrix
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=len(dictionary))

### Retrieve top N document for the given query string

In [None]:
def get_closest_n(query, n):
    '''get the top matching docs as per cosine similarity
    between tfidf vector of query and all docs'''
    query_document = cleaning_pipe(query)
    query_bow = dictionary.doc2bow(query_document)
    sims = index[tfidf[query_bow]]
    top_idx = sims.argsort()[-1*n:][::-1]
    return [text_docs[i] for i in top_idx]

In [None]:
for d in get_closest_n("What is Data Science",2):
    print(d)

Applied Data Science
Ian Langmore Daniel Krasner
CONTENTS v
What is data science? With the major technological advances of the last
two decades, coupled in part with the internet explosion, a new breed of
analysist has emerged. The exact role, background, and skill-set, of a data
scientist are still in the process of being deﬁned and it is likely that by the
time you read this some of what we say will seem archaic.
In very general terms, we view a data scientist as an individual who uses
current computational techniques to analyze data. Now you might make
the observation that there is nothing particularly novel in this, and subse-
quentyaskwhathasforcedthedeﬁnition.1 Afterallstatisticians,physicists,
biologisitcs, ﬁnance quants, etc have been looking at data since their respec-
tive ﬁelds emerged. One short answer comes from the fact that the data
sphere has changed and, hence, a new set of skills is required to navigate it
eﬀectively. The exponential increase in computational power ha

In [None]:
import pickle 

In [None]:
with open('index.pkl', 'wb') as p:
     # serialize class object
     pickle.dump(index, p)

In [None]:
with open('tfidf.pkl', 'wb') as p:
     # serialize class object
     pickle.dump(tfidf, p)
     

In [None]:
with open('bow_corpus.pkl', 'wb') as p:
     # serialize class object
     pickle.dump(bow_corpus, p)
     