## LLDA TOPIC MODELING


## Using Scrapy to Get Data from NTRS

Scrapy is a very good web crawling framework. We have finished the crawler and have successfully almost 6000 papers from NTRS(NASA Technical Reports Server). Within this framework, we can easily handle request and add dely of each request to make sure that we will not put so much pressure on the server.

To use scrapy to crawl papers from NASA, we need to first define items, which is the data structure we will to get from website. Define as following code.

```python
import scrapy
class PaperCrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field();
    ntrs_full_text = scrapy.Field();
    author_and_affiliation = scrapy.Field();
    abstract = scrapy.Field();
    publication_date = scrapy.Field();
    document_id = scrapy.Field();
    subject_category = scrapy.Field();
    patent_number = scrapy.Field();
    document_type = scrapy.Field();
    meeting_information = scrapy.Field();
    meeting_sponsor = scrapy.Field();
    financial_sponsor = scrapy.Field();
    organization_source = scrapy.Field();
    description = scrapy.Field();
    NASA_terms = scrapy.Field();
    other_descriptors = scrapy.Field();
```

First we defined entries for the urls in paper_spider.py:

```python
def start_requests(self):
        for i in range(102231):
            yield self.make_requests_from_url(
                "http://ntrs.nasa.gov/search.jsp?Ntx=mode%20matchall&Ntk=All&N=0&No=" + str(i * 10))
```

And then, we carefully analyze the html DOM tree to find the values we need and used xpath to extract data we need. This is a very annoying task, because we need to handle every situation and sometimes the elements are not in the same format. 

After we extracted value from DOM tree and we create a new object item and then we used pipline to store in to MongoDB.

Because at first, we assumed that every paper will have a document ID, but we found that, there are many papers without document ID. So, we created out own ID counter, since scrapy is a multi-thread framework, we need add lock to the counter.

In [2]:
class MongoPipeline(object):

    collection_name = 'paper'
    global counter
    counter = 30000000000

    global counter_lock
    counter_lock = threading.Lock()

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'paper_crawler_all_new')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        if dict(item).get('document_id') == "":
            global counter_lock
            counter_lock.acquire()
            try:
                global counter
                item['document_id'] = str(counter)
                counter += 1
            finally:
                counter_lock.release()
        self.db[self.collection_name].insert(dict(item))
        return item

NameError: name 'threading' is not defined

# Topic Mining based on Labeled-LDA

After we extract plain text from paper PDF, we get the raw data, next step we will use topic modeling tools to analyze topic keywords of each paper. The tool we use is [Stanford Topic Modeling Toolbox](http://nlp.stanford.edu/software/tmt/tmt-0.4/), it provides functions to train topic models based on Labeled-LDA to create summaries of the text. So to understand how it works, we need to dig into the principle of LDA and Labeled-LDA.


## LDA
LDA stands for Latent Dirichlet Allocation, is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael I. Jordan in 2003.

As defined in [Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)


In LDA, each document may be viewed as a mixture of various topics. It attempts to guess the topic of a document based on the words that are contained within that document and based on other documents within the same corpus that have similar words. LDA can be thought of as a type of clustering algorithm, where each observation is classified as one of a number of classifications. 

LDA is a completely unsupervised algorithm that models each document as a mixture of topics. The model generates automatic summaries of topics in terms of a discrete probability distribution over words for each topic, and further infers per-document discrete distributions over topics. Most importantly, LDA makes the explicit assumption that each word is generated from one underlying topic.

But LDA is not appropriate for multi-labeled corpora because, as an unsupervised model, it offers no obvious way of incorporating a supervised label set into its learning procedure. In particular, LDA often learns some topics that are hard to interpret, and the model provides no tools for tuning the generated topics to suit an end-use application, even when time and resources exist to provide some document labels. So we use Labeled-LDA, one supervised way to correct training model.

## Labeled-LDA 

Labeled LDA (L-LDA), a generative model for multiplying labeled corpora that marries the multi-label supervision common to modern text datasets with the word-assignment ambiguity resolution of the LDA family of models. In contrast to standard LDA and its existing supervised variants, our model associates each label with one topic in direct correspondence.

Like Latent Dirichlet Allocation, Labeled LDA models each document as a mixture of underlying topics and generates each word from one topic. Unlike LDA, L-LDA incorporates supervision by simply constraining the topic model to use only those topics that correspond to a document’s (observed) label set.

To implement L-LDA, we use Stanford TMT toolbox to simply perform training topic model on the given dataset, we store each paper's raw text into csv format, use this as input dataset, then run TMT Toolbox we downloaded.

## TMT Toolbox
The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. It has easily usable Labeled-LDA training model function, to use it we need to specify a LabeledLDA dataset, and tell the toolbox where the text comes from as well as where the labels come from. 

Below is our config script:

In [None]:
val source = CSVFile("%s") ~> IDColumn(1);

import scala.io.Source
val listOfLines = Source.fromFile("stopwords.txt").getLines.toList
val ll = listOfLines.map( x => x.stripLineEnd )

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
}

val text = {
  source ~>                              // read from the source file
  Column(3) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(10) ~>   // filter terms in <4 docs
  TermStopListFilter(ll) ~>
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(10)         // take only docs with >=5 terms
}

// define fields from the dataset we are going to slice against
val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
  TermCounter() ~>                       // collect label counts
  TermMinimumDocumentCountFilter(10)     // filter labels in < 10 docs
}

val dataset = LabeledLDADataset(text, labels);

// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);

// Name of the output model folder to generate
val modelPath = file("%s");

// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);

After running the script, using bash command ```jre1.7.0_80/bin/java -jar tmt-0.4.0.jar script``` to run LLDA.