#Big Data Finals - Word Co-occurnce in pubMed database
##Asutosh Satapathy {asatapat@andrew.cmu.edu}

###Problem Statement
For this project, I will utilize the Pubmed Central (PMC) open access dataset, implement a system to process the PMC dataset with a user-provided term, and generate a JSON file for use in D3-driven visualization. The goal of this final is to produce a visualization of the topics that most commonly co-occur in the pubmed documents with a user provided term or phrase.

###Introduction
PubMed database contains millions of documents related to various articles. Searching through all of the dat manually is a very daunting (sort of impossible task). Hence, I wll be attempting to help explore the database with minimal effort. The basic idea is given a word, find all other words which are co-occuring. There are various ways to approach this problem. 
The approach which I have implemented is as follows: I will be scanning the entire database for this word. I will see which documents contains this word. And after that, I will be selecting the other words based on their co-occurrence in the document.
Again, this can be done in various ways. But before I delve into the nifty details, let's first cover some of the basics of text mining which I learnt during the course of this project. This will be helpful in understanding the underlying concept of the application.
- Tokenization: Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.
- Stop Words: In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text). In terms of linguistics these words are called as function words. Words like ’a’, ’an’, ’the’ are examples for stop words. There is no defined set of stop words available. Different applications and research groups uses different sets o stop words.
Generally stop words are omitted in text mining process. The frequency of stop words will be very high in any corpus compared to content words.Pointers to some good stop word list is available at http://en.wikipedia.org/wiki/Stop_words
- Bag of Words: The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as an un-ordered collection of words, disregarding grammar and even word order. Analyzing text by only analyzing frequency of words is called as bag of words model.
- TF-IDF: Tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus.
![image1](images/tfidf1.png)
![image2](images/tfidf2.png)
- LDA: In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

Having set these basic fundamentals, let's try and analyze the problem at hand.

###Algorithm
####Pre-processing
First, we have to process the raw data files. I had two options. First option was to read the XML annonated files and get the data from there. And second, I can read the raw text files. I prefered the second option for simplicity. I could not figure out the advantage of using the XML files over .txt files. But there was an computation overhead for option 1 as compared to option 2. I am skeptical that I have missed out some important feature by not using the XML files. But that's a work for another day and we can improve on the existing work later on. 

So now we have to recursively go into each file and read it's content. This can be done by the code below:


In [None]:
#
# This is for logging data
#
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logger = logging.getLogger('text_mining_logger')
#
# We will be appending all our documents to this array.
#
documents = []

#
# get the class-names from the directory structure
#
directory_names = list(set(glob.glob(os.path.join("docs", "*"))
                           ).difference(set(glob.glob(os.path.join("docs", "*.*")))))
#
# List of string of class names
#
namesClasses = list()

#
# Navigate through the list of directories recursively
#
logger.info("Reading the documents")
for folder in directory_names:
    #
    # Append the string class name for each class
    #
    currentClass = folder.split(os.pathsep)[-1]
    namesClasses.append(currentClass)

    for fileNameDir in os.walk(folder):
        for fileName in fileNameDir[2]:
            #
            # Only read in the text files
            #
            if fileName[-4:] != ".txt":
                continue
            nameFileImage = "{0}{1}{2}".format(fileNameDir[0], os.sep, fileName)
            with open(nameFileImage, 'r') as myfile:
                #
                # Read the file and remove the new-line characters.
                #
                data=myfile.read().replace('\n', '')
                #
                # Remove all special characters.
                #
                new_string = re.sub('[^a-zA-Z0-9]', ' ', data)
            documents.append(new_string)

As you can see, I have also done some basic filtering. I have removed all special characters from the document. When I was browsing through the files, I noticed that the files contained a lot of unicode special characters and these characters don't have any special significance towards the content. Hence, I filtered these out with a simple regEX.

The next step is to tokenize the words. And we also have to remove the commonly occuring words as "a, an, is, are, will" etc. 