# Pre-processing

### Document fo vector using Word2Vec
Word2Vec library is utilized to convert each word into a vector. Vector corresponding to a document is the average Word2Vec of all words that are present in the document and Word2Vec library. Each document vector is a N = 300 dimension. In this representation, semantically similar vectors remains close to each other in N dimensional space. 

In [1]:
# Document to Vector

import gensim.models.keyedvectors as word2vec
import numpy as np

class DocumentToVector:

    def __init__(self):
        filename = 'GoogleNews-vectors-negative300.bin.gz'
        self.__word2vec_model = word2vec.KeyedVectors.load_word2vec_format(filename, binary=True)
        # Normalize vectors
        self.__word2vec_model.init_sims(replace=True)

    def getWord2Vec(self, word):
        return self.__word2vec_model[word]


    def getDoc2Vec(self, doc):
        doc = [word for word in doc if word in self.__word2vec_model.vocab]
        # Check if at least one word in the vocab is present in the doc
        if(len(doc)):
            return np.mean(self.__word2vec_model[doc], axis=0)
        else:
            return np.array([0])
        

In [2]:
doc2vec = DocumentToVector()

# Check just for one file

In [3]:
doc = '''
Markov chain Monte Carlo (MCMC) methods are an important algorithmic
 device in a variety of fields.  This project studies techniques for rigorous
 analysis of the convergence properties of Markov chains.   The emphasis is on
 refining probabilistic, analytic and combinatorial tools (such as coupling,
 log-Sobolev, and canonical paths) to improve existing algorithms and develop
 efficient algorithms for important open problems.
 Problems arising in
 computer science, discrete mathematics, and physics are of particular interest,
 e.g., generating random colorings and independent sets of bounded-degree
 graphs, approximating the permanent, estimating the volume of a convex body,
 and sampling contingency tables.  The project also studies inherent connections
 between phase transitions in statistical physics models and convergence
 properties of associated Markov chains.
 The investigator is developing a
 new graduate course on MCMC methods.
'''
import pandas as pd
doc_vec = doc2vec.getDoc2Vec(doc)
print(f'Object type: {type(doc_vec)}')
print(f'Doc vector length: {len(doc_vec)}')
print(f'Data type: {doc_vec.dtype}')

Object type: <class 'numpy.ndarray'>
Doc vector length: 300
Data type: float32


# Pre-Process all the files in directory tree

Unique identifier 'AwardNo' and abstract content is extracted from every file of NSF grant dataset: Part-1.

In [4]:
class AbstractExtrator:

    def __init__(self):
        pass

    def extractAbstract(self, filepath):
        line_index = 0
        line_index_start = 0
        with open(filepath, encoding='iso-8859-1') as f:  
        #with open(filepath) as f:  
            content = f.readlines()
            for line in content:
                #print(line)
                line_index += 1
                if line.find('Award Number:') != -1:
                    awardNo = line.split(':')[1].strip()
                if line.find('Abstract    :') != -1:
                    break

        #print('line_index:', line_index)
        abstract = ' '.join(content[line_index:]).strip()
        return abstract,awardNo

All documents with empty/null/Not-Available abstracts are filtered out.

In [5]:
def docs_filter_(doc):
    if (len(doc) == 0):
        return None
    
    NotAvailable = "Not Available"
    NotAvailable_Content_Len = 13  # Found by looking at raw data
    # 'no' or 'Not available' or ***** found in abstract that need to exclude
    #if (doc.find(NotAvailable) != -1 and len(doc) == NotAvailable_ContentLen):
    if len(doc) < 20: # Very short documents
        return None
    
    return doc

In [6]:
import os
from fnmatch import fnmatch
pattern = "*.txt"

def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            if fnmatch(name, pattern):
                r.append(os.path.join(root, name))
    return r

files = list_files('./Part1/')
#print(files))

docs = []
docs_discarded = []
docs_vector = []

no_files = 0
no_skipped_files = 0
for file in files:
    #print(file)
    # Extract Abstract
    absExt = AbstractExtrator()
    doc,award = absExt.extractAbstract(file)
    #print('Award: ', award)
    if award == 'null':
        #print('Skipped due to null Award')
        docs_discarded.append([award,doc])
        continue
    
    # Skipping Blank Abstract
    if(docs_filter_(doc) == None):
        #print('Skipped due to blank Abstract')
        no_skipped_files += 1
        docs_discarded.append([award,doc])
        continue
    
    docs.append([award,doc])
    #Determine vector representation of the document
    vec = doc2vec.getDoc2Vec(doc)
    
    if(len(vec) != 1):
        vec = np.insert(vec, 0, award)
        #print(vec)
        docs_vector.append(vec)
        no_files += 1
        
    else:
       # print('Skipped due to poor vocab')
        no_skipped_files += 1
        docs_discarded.append([award,doc])
        continue
    
    #break
    
print(f'No of valid files = {no_files}')
print(f'No of skipped files scanned = {no_skipped_files}')

No of valid files = 49074
No of skipped files scanned = 2685


Number of documents are chosen for processing: 49074

In [7]:
docs_vector = np.array(docs_vector)

# Write the document vectors into files for easy post-processing

'docs_vector.csv' is a pre-processed file containing 49047 documents each of 100 dimension. The index is the award number.

In [8]:
principalDf = pd.DataFrame(docs_vector)
principalDf.set_index(principalDf.columns[0],inplace=True)

In [9]:
principalDf.to_csv('docs_vector.csv',header=None)

# Test the processed document


Award number and abstract of selected files are written in 'docs.txt'

In [10]:
# Testing if docs are extracted properly
import  csv
with open("docs.txt","w") as f:
    for item in docs:
        #if(docs_filter_(doc) != None):
        f.write(item[0])
        f.write(": \n")        
        f.write(item[1])
        f.write("\n @@@@@*****@@@@@ \n")        

Award number and abstract of discarded files are written in 'docs_discarted.txt'

In [11]:
# Testing if docs are extracted properly
import  csv
with open("docs_discarded.txt","w") as f:
    for item in docs_discarded:
        #if(docs_filter_(doc) != None):
        f.write(item[0])
        f.write(": \n")        
        f.write(item[1])
        f.write("\n @@@@@*****@@@@@ \n")        

In [12]:
# Testing random abstracts
absExt1 = AbstractExtrator()
test_abstract, test_awardNo = absExt1.extractAbstract('./Part1/awards_1991/awd_1991_13/a9113226.txt')
print('Test Abstract: ', test_abstract)
print('Test awardNo: ', test_awardNo)

Test Abstract:  Abstractions are the basis for most of the ways that programmers look          
               at their programs, data, and execution.  Most of these abstractions            
               have a visual component.  This work envisions a system that will allow         
               programmers to rapidly define abstractions and their accompanying              
               visualizations.  Such a system would have immediate applications to            
               programming environments, data bases, parallel programming, and                
               education.  This project will investigate what it means to design such         
               a system and to explore many of the underlying problems that must be           
               solved before such a system can become a reality.  In conjunction with         
               other research projects, it will lead to an abstraction-based                  
               visualization environment to exper