# LLDA Topic Modeling and Data Visualization

This project is a pratical scenario for us to implement the knowledge of this course.
In this project, we will first collect data with a crawler and then use LLDA to get the topic of the data collected by us and then using data visualization tools to display the result of our project.

## 0 Using Scrapy to Get Data from NTRS

Scrapy is a very good web crawling framework. We have finished the crawler and have successfully almost 6000 papers from NTRS(NASA Technical Reports Server). Within this framework, we can easily handle request and add dely of each request to make sure that we will not put so much pressure on the server.

First create a new scrapy project:
```bash
scrapy startproject papercrawler
```
We will get the following project:

```bash
.
├── Resources
├── items.json
├── paper_crawler
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── pipelines.py
│   ├── pipelines.pyc
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── __init__.py
│       ├── __init__.pyc
│       ├── paper_spider.py
│       └── paper_spider.pyc
├── scrapy.cfg
└── urls
```
We need to configure the following parts to make our crawler work.


### 0.0 Define Items

To use scrapy to crawl papers from NASA, we need to first define items, which is the data structure we will to get from website. Define as following code.

In [2]:
import scrapy
class PaperCrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field();
    ntrs_full_text = scrapy.Field();
    author_and_affiliation = scrapy.Field();
    abstract = scrapy.Field();
    publication_date = scrapy.Field();
    document_id = scrapy.Field();
    subject_category = scrapy.Field();
    patent_number = scrapy.Field();
    document_type = scrapy.Field();
    meeting_information = scrapy.Field();
    meeting_sponsor = scrapy.Field();
    financial_sponsor = scrapy.Field();
    organization_source = scrapy.Field();
    description = scrapy.Field();
    NASA_terms = scrapy.Field();
    other_descriptors = scrapy.Field();

### 0.1 Defined entries for the urls in paper_spider.py:
We need to input a pattern of a series of URLs so that the crawler can request the page iteratively, we need to analysis the start url from "http://ntrs.nasa.gov/search.jsp?Ntx=mode%20matchall&Ntk=All&N=0&No=" and use the pattern to request pages.

In [3]:
def start_requests(self):
        for i in range(102231):
            yield self.make_requests_from_url(
                "https://ntrs.nasa.gov/search.jsp?Ntx=mode%20matchall&Ntk=Title&N=0&Ntt=hurricane&No=" + str(i * 10))

### 0.2 Define the function to parse the content of page and retrive information
we carefully analyze the html DOM tree to find the values we need and used xpath to extract data we need. This is a very annoying task, because we need to handle every situation and sometimes the elements are not in the same format. 

In [4]:
def parse_content(self, response):
        title_list = response.xpath('//td[@id="recordtitle"]//text()').extract()
        title_string = ""
        item = PaperCrawlerItem()
        for word in title_list:
            title_string += word
        # print(str(title_string))
        item['title'] = title_string

        pdf_url = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"NTRS Full-Text:")]/td/a[contains(text(),"Click to View")]/@href').extract()
        item['ntrs_full_text'] = pdf_url

        user_affilliation = [];
        u_a = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Author and Affiliation:")]/td/table/tr').extract()
        if u_a != []:
            for entry in u_a:
                author = Selector(text=entry).xpath('//tr/td/text()').extract()
                affilliation = Selector(text=entry).xpath('//tr/td/span/text()').extract()
                temp = (author[0], affilliation[0].replace("(", "").replace(")", ""))
                user_affilliation.append(temp)
        else:
            u_a = response.xpath(
                '//table[@id="doctable"]/tr[contains(td/text(),"Author and Affiliation:")]/td/text()').extract()[1:]
            # print(u_a)
            for entry in u_a:
                author = entry.replace("\r", "").replace("\n", "").replace("\t", "")
                if author != "":
                    temp = (author, "")
                    user_affilliation.append(temp)
        # print(user_affilliation)
        item['author_and_affiliation'] = user_affilliation

        abstract_list = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Abstract:")]/td/descendant-or-self::*/text()').extract()
        abstract = ""
        for words in abstract_list:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Abstract:" and words != "":
                abstract += words
        item["abstract"] = abstract

        publication_date_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Publication Date:")]//text()').extract()
        publication_date = ""
        for words in publication_date_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Publication Date:" and words != "":
                publication_date = words;
        item["publication_date"] = publication_date

        document_id_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Document ID:")]/td/div[@id="docidDiv"]/text()').extract()
        document_id = ""
        for words in document_id_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "":
                document_id = words;
        item["document_id"] = document_id

        subject_category_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Subject Category:")]//text()').extract()
        subject_category = ""
        for words in subject_category_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Subject Category:" and words != "":
                subject_category += words
        item["subject_category"] = subject_category

        patent_number_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Report/Patent Number:")]//text()').extract()
        patent_number = ""
        for words in patent_number_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Report/Patent Number:" and words != "":
                patent_number += words
        item["patent_number"] = patent_number

        document_type_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Document Type:")]//text()').extract()
        document_type = ""
        for words in document_type_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Document Type:" and words != "":
                document_type += words
        item["document_type"] = document_type

        meeting_information_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Meeting Information:")]//text()').extract()
        meeting_information = ""
        for words in meeting_information_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Meeting Information:" and words != "":
                meeting_information += words
        item["meeting_information"] = meeting_information

        meeting_sponsor_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Meeting Sponsor:")]//text()').extract()
        meeting_sponsor = ""
        for words in meeting_sponsor_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Meeting Sponsor:" and words != "":
                meeting_sponsor += words
        item["meeting_sponsor"] = meeting_sponsor

        financial_sponsor_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Financial Sponsor:")]//text()').extract()
        financial_sponsor = ""
        for words in financial_sponsor_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Financial Sponsor:" and words != "":
                financial_sponsor += words
        item["financial_sponsor"] = financial_sponsor

        organization_source_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Organization Source:")]//text()').extract()
        organization_source = ""
        for words in organization_source_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Organization Source:" and words != "":
                organization_source += words
        item["organization_source"] = organization_source

        description_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Description:")]//text()').extract()
        description = ""
        for words in description_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Description:" and words != "":
                description += words
        item["description"] = description

        NASA_terms_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"NASA Terms:")]//text()').extract()
        NASA_terms = ""
        for words in NASA_terms_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "NASA Terms:" and words != "":
                NASA_terms += words
        item["NASA_terms"] = NASA_terms

        other_descriptors_raw = response.xpath(
            '//table[@id="doctable"]/tr[contains(td/text(),"Other Descriptors:")]//text()').extract()
        other_descriptors = ""
        for words in other_descriptors_raw:
            words = words.replace("\r", "").replace("\n", "").replace("\t", "")
            if words != "Other Descriptors:" and words != "":
                other_descriptors += words
        item["other_descriptors"] = other_descriptors

        yield item

### 0.3 Define Pipeline to store items to MongoDB
After we extracted value from DOM tree and we create a new object item and then we used pipline to store in to MongoDB.
Because at first, we assumed that every paper will have a document ID, but we found that, there are many papers without document ID. So, we created out own ID counter, since scrapy is a multi-thread framework, we need add lock to the counter.


In [None]:
import pymongo
import threading

class MongoPipeline(object):

    collection_name = 'paper'
    global counter
    counter = 30000000000

    global counter_lock
    counter_lock = threading.Lock()

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'paper_crawler_all_new')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        if dict(item).get('document_id') == "":
            global counter_lock
            counter_lock.acquire()
            try:
                global counter
                item['document_id'] = str(counter)
                counter += 1
            finally:
                counter_lock.release()
        self.db[self.collection_name].insert(dict(item))
        return item

### 0.4 Result of Data Collecting
After the whole process of crawling finished, we can nearly get 6000 documents with the keyword hurricane. And in mongoDB, we can get the metadata of each document. But it is not enough, because there are so many documents that are not papers or they do not have url to PDF, so we need to further process these data.

Here is the schema of the paper metadata:

```json
/* 1 */
{
    "_id" : ObjectId("57c6232ddaa39c2add17f1f8"),
    "document_type" : "Conference Paper",
    "financial_sponsor" : "NASA Marshall Space Flight Center; Huntsville, AL United States",
    "organization_source" : "NASA Marshall Space Flight Center; Huntsville, AL United States",
    "description" : "3p; In English",
    "title" : "Validation of Rain Rate Retrievals for the Airborne Hurricane Imaging Radiometer (HIRAD)",
    "ntrs_full_text" : [ 
        "http://hdl.handle.net/2060/20150022936"
    ],
    "abstract" : "The NASA Hurricane and Severe Storm Sentinel (HS3) mission is an aircraft field measurements program using NASA's unmanned Global Hawk aircraft system for remote sensing and in situ observations of Atlantic and Caribbean Sea hurricanes. One of the principal microwave instruments is the Hurricane Imaging Radiometer (HIRAD), which measures surface wind speeds and rain rates. For validation of the HIRAD wind speed measurement in hurricanes, there exists a comprehensive set of comparisons with the Stepped Frequency Microwave Radiometer (SFMR) with in situ GPS dropwindsondes [1]. However, for rain rate measurements, there are only indirect correlations with rain imagery from other HS3 remote sensors (e.g., the dual-frequency Ka- & Ku-band doppler radar, HIWRAP), which is only qualitative in nature. However, this paper presents results from an unplanned rain rate measurement validation opportunity that occurred in 2013, when HIRAD flew over an intense tropical squall line that was simultaneously observed by the Tampa NEXRAD meteorological radar (Fig. 1). During this experiment, Global Hawk flying at an altitude of 18 km made 3 passes over the rapidly propagating thunderstorm, while the TAMPA NEXRAD perform volume scans on a 5-minute interval. Using the well-documented NEXRAD Z-R relationship, 2D images of rain rate (mm/hr) were obtained at two altitudes (3 km & 6 km), which serve as surface truth for the HIRAD rain rate retrievals. A preliminary comparison of HIRAD rain rate retrievals (image) for the first pass and the corresponding closest NEXRAD rain image is presented in Fig. 2 & 3. This paper describes the HIRAD instrument, which 1D synthetic-aperture thinned array radiometer (STAR) developed by NASA Marshall Space Flight Center [2]. The rain rate retrieval algorithm, developed by Amarin et al. [3], is based on the maximum likelihood estimation (MLE) technique, which compares the observed Tb's at the HIRAD operating frequencies of 4, 5, 6 and 6.6 GHz with corresponding theoretical Tb values from a forward radiative transfer model (RTM). The optimum solution is the integrated rain rate that minimizes the difference between RTM and observed values. Because the excess Tb from rain comes from the direct upwelling and the indirect reflected downwelling paths through the atmosphere, there are several assumptions made for the 2D rain distribution in the antenna incident plane (crosstrack to flight direction). The opportunity to knowing 2D rain surface truth from NEXRAD at two different altitudes will enable a comprehensive evaluation to be preformed and reported in this paper.",
    "subject_category" : "METEOROLOGY AND CLIMATOLOGY",
    "author_and_affiliation" : [ 
        [ 
            "Jacob, Maria Marta", 
            "National Commission of Space Activities, Buenos Aires, Argentina"
        ], 
        [ 
            "Salemirad, Matin", 
            "University of Central Florida, Orlando, FL, United States"
        ], 
        [ 
            "Jones, W. Linwood", 
            "University of Central Florida, Orlando, FL, United States"
        ], 
        [ 
            "Biswas, Sayak", 
            "NASA Marshall Space Flight Center, Huntsville, AL United States"
        ], 
        [ 
            "Cecil, Daniel", 
            "NASA Marshall Space Flight Center, Huntsville, AL United States"
        ]
    ],
    "other_descriptors" : "HURRICANE;  RAIN RETRIEVAL",
    "meeting_information" : "IGARSS 2015; 26-31 Jul. 2015; Milan; Italy",
    "publication_date" : "Jul 26, 2015",
    "meeting_sponsor" : "Institute of Electrical and Electronics Engineers; Geoscience and Remote Sensing Society; New York, NY, United States",
    "patent_number" : "MSFC-E-DAA-TN20510",
    "document_id" : "20150022936",
    "NASA_terms" : "IMAGING TECHNIQUES; METEOROLOGICAL RADAR; HURRICANES; MICROWAVE RADIOMETERS; THUNDERSTORMS; RAIN; GROUND TRUTH; RADIATIVE TRANSFER; REMOTE SENSING; TRACKING (POSITION)"
}

```

We can use document_type to filter the documents which are not paper and also we can use ntrs_full_text to get the pdf of this paper, also, the publication_date is very important. We will use it to analyze the topic trend of recent years.

## 1 Using __tika__ to get text content of PDF
After we store the metadata of crawled papers, we can fetch the url of PDF links from MongoDB, using the url we download the PDF file to temp folder, then parse the file as argument of Open Source Tools tika, it can easily convert PDF to txt content.

Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). And python has a python port library for us to use, https://github.com/chrismattmann/tika-python


In [1]:
import tika
from tika import parser
import requests

def convertPdfToTextByPathAndFilelocation(path):
    webPdf = requests.get(path)
    # Store as a temporary file
    path = "tmp"
    tmpFile = open("tmp", 'w')
    tmpFile.write(webPdf.content)
    tmpFile.close()

    return pdf2Text(path)

tika.TikaClientOnly = True

def pdf2Text(filePath):
    return parser.from_file(filePath)['content']


ImportError: No module named tika

In [2]:
# Example url
url = "https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20150000264.pdf"
print convertPdfToTextByPathAndFilelocation(url)

NameError: name 'convertPdfToTextByPathAndFilelocation' is not defined

## 2 Using The Stanford Topic Modeling Toolbox (TMT) to extract Topic of each Paper
LDA stands for Latent Dirichlet Allocation, it is a kind of documents topic model. It is also a kind of 3-Layer Bayesian model, including word, topic and document. And for modeling, we can say that each words of one document belongs to one topic with a certain probablity and choose a certain word from this topic. So in this process, document and topic is multinomial distribution, topic and word is multinomial distribution.

This process of LDA can be simplified to the following process:
1. for each document, choose a topic from topic distribution
2. choose a word from the topic's word distribution
3. repeat until each word in the document is visited,

We can define a document set as D and topic set as T
And for each document in D, we can count it as a vector of words, we called it word bag.
Also, we need to count all words in D to a set as vocabulary. 
For a document in D, we can get it's topic distribution and Pti represents the probability of d in number i topics, Pwi represents the probability of i th word in topic t.

And the LDA equtaion will be:

$P(w|d) = P(w|t) * P(t|d)$

Using topic a middle layer and current θd and φt get the probablity of w in d. And with more iterations, we will get Converged result.

In this project, we will use The Stanford Topic Modeling Toolbox (TMT) to do the topic modeling on our data. And in order to make the topic more readable and more accurate, we will use the Labeled-LDA feature of TMT.

The input is a csv file with the format:

| document_id | pre_labels                                                                       | content |
|-------------|----------------------------------------------------------------------------------|---------|
| 20100032965 | SPECTRAL/ENGINEERING MICROWAVE OCEAN TEMPERATURE BRIGHTNESS TEMPERATURE CYCLONES | .....   |

and we also need to use a scala script to define the task and load stopwords.

```scala
val source = CSVFile("%s") ~> IDColumn(1);

import scala.io.Source
val listOfLines = Source.fromFile("stopwords.txt").getLines.toList
val ll = listOfLines.map( x => x.stripLineEnd )

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
}

val text = {
  source ~>                              // read from the source file
  Column(3) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(10) ~>   // filter terms in <4 docs
  TermStopListFilter(ll) ~>
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(10)         // take only docs with >=5 terms
}

// define fields from the dataset we are going to slice against
val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
  TermCounter() ~>                       // collect label counts
  TermMinimumDocumentCountFilter(10)     // filter labels in < 10 docs
}

val dataset = LabeledLDADataset(text, labels);

// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);

// Name of the output model folder to generate
val modelPath = file("%s");

// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
```

(We have to use java7 to run this jar. Took us a liitle time to figure out where was going wrong...)

After the LLDA is done, we will get 2 useful files: document-topic-distributions.csv and label-index.txt
In document-topic-distributions.csv file, we can get the topic probability of each document.

|             |    |             |    |             |    |             |    |             |    |             | 
|-------------|----|-------------|----|-------------|----|-------------|----|-------------|----|-------------| 
| 20140017430 | 0  | 0.281779471 | 1  | 0.467417845 | 2  | 0.225072019 | 3  | 0.025692527 | 4  | 3.81E-05    | 
| 20140013335 | 5  | 0.07320749  | 6  | 0.059176645 | 7  | 0.867615866 |    |             |    |             | 
| 20150000261 | 3  | 2.66E-05    | 6  | 6.92E-06    | 8  | -9.69E-07   | 9  | -6.99E-07   | 10 | 0.999968175 | 
| 20120002859 | 0  | 3.94E-04    | 2  | 0.579140661 | 11 | 4.42E-05    | 12 | 0.420420941 |    |             | 
| 20100022203 | 0  | -2.10E-05   | 11 | -1.58E-05   | 12 | 1.000072506 | 13 | -3.57E-05   |    |             | 
| 20100032965 | 0  | 0.094512137 | 6  | 9.99E-05    | 7  | 0.081092946 | 11 | 3.67E-05    | 14 | 0.824258282 | 
| 20110006355 | 15 | 1           |    |             |    |             |    |             |    |             | 
| 20140010541 | 2  | 0.007491234 | 3  | 3.06E-04    | 6  | 0.02016012  | 8  | 0.00551627  | 10 | 0.966525931 | 

The int is the index of a topic and the float is the probability of this topic in this document

And in label-index.txt we can get the list of topics and the index is the index of the topics. We will us python to parse the label-index.txt get the index of each topic and get the topic of each paper


In [8]:
def parseIndex(filename):
    f = open(filename, 'r')
    res = {}
    i = 0
    for line in f:
        res[i] = line.strip()
        i += 1
    return res

print parseIndex("label-index.txt")

{0: 'HURRICANES', 1: 'STORM/SURGE', 2: 'STORMS', 3: 'PRECIPITATION', 4: 'EROSION', 5: 'GRAVITY/WAVE', 6: 'CYCLONES', 7: 'TROPICAL/CYCLONES', 8: 'HUMIDITY', 9: 'WIND/SHEAR', 10: 'CONVECTION', 11: 'MICROWAVE', 12: 'SURFACE/WINDS', 13: 'ALTITUDE', 14: 'BRIGHTNESS/TEMPERATURE', 15: '', 16: 'LIDAR', 17: 'LIGHTNING', 18: 'RADAR', 19: 'VORTICITY', 20: 'TURBULENCE', 21: 'WATER/VAPOR', 22: 'CLOUDS', 23: 'SURFACE/PRESSURE', 24: 'CONDENSATION', 25: 'TORNADOES', 26: 'TROPOPAUSE', 27: 'NATURAL/HAZARDS', 28: 'EVAPORATION', 29: 'OSCILLATIONS', 30: 'TYPHOONS', 31: 'AEROSOLS', 32: 'SEA/SURFACE/TEMPERATURE', 33: 'WIND/PROFILES', 34: 'TIDES', 35: 'ATMOSPHERIC/TEMPERATURE', 36: 'HEAT/FLUX', 37: 'OCEAN/CIRCULATION'}


In [18]:
import csv
def getTopicOfPaper(dis_csv):
    index_dict = parseIndex("label-index.txt")
    paper_topic = {}
    with open(dis_csv, 'rb') as csvfile:
        paper_topic = {}
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            max_index = 0
            max_num = float(0)
            for i in range(1 , len(row)):
                if i%2 == 1:
                    continue
                if float(row[i]) > float(max_num):
                    max_num = float(row[i])
                    max_index = i
            if max_index - 1 > 0:
                paper_topic[row[0]] = index_dict[int(row[max_index - 1])]
    return paper_topic

topic_dict = getTopicOfPaper("document-topic-distributions.csv")

## 3 Data Visualization

Now we need to find the trend of topics of each year, so we need to do the simple statistical on the data.

First we need to get the number of papers of each topic in each year


In [17]:
from pymongo import MongoClient
def getPapersYear(paperId):
    client = MongoClient(host='localhost', port=27017)
    papers = client.paper_crawler_all_new.paper
    paper = papers.find_one({"document_id":str(paperId)})
    return paper['publication_date'].split(',')[1].strip()
print getPapersYear(20160011133)

2015


In [None]:
import collections
def getYearTopics(topic_dict):
    res = collections.defaultdict(dict)
    for k,v in topic_dict.items():
        if getPapersYear(k) not in res[v]:
            res[v][getPapersYear(k)] = 0
        res[v][getPapersYear(k)] += 1
    return res


Then we output the data into csv format document, after that we can use these document as raw input, build our frontend viewer to show the topic trend.

The data format is shown below:

| CountryName             | CountryCode             | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1996 | 1999 | 2001 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
|-------------------------|-------------------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| AEROSOLS                | AEROSOLS                | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 0    | 0    |
| ALTITUDE                | ALTITUDE                | 0    | 1    | 0    | 0    | 0    | 1    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 5    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 0    | 0    |
| ATMOSPHERIC_TEMPERATURE | ATMOSPHERIC_TEMPERATURE | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| BRIGHTNESS_TEMPERATURE  | BRIGHTNESS_TEMPERATURE  | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 1    | 0    | 2    | 1    | 5    | 2    | 0    | 1    | 0    |
| CLOUDS                  | CLOUDS                  | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 2    | 0    | 1    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    |
| CONDENSATION            | CONDENSATION            | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| CONVECTION              | CONVECTION              | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 1    | 0    | 0    | 1    | 0    | 0    | 1    | 2    | 6    | 2    | 4    | 4    | 1    | 0    | 0    |
| EROSION                 | EROSION                 | 0    | 1    | 0    | 0    | 0    | 1    | 1    | 1    | 2    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    |
| EVAPORATION             | EVAPORATION             | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 1    | 0    |
| GRAVITY_WAVE            | GRAVITY_WAVE            | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| HEAT_FLUX               | HEAT_FLUX               | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 0    | 1    | 2    | 0    | 0    | 0    | 1    | 0    | 2    | 0    | 0    | 0    |
| HUMIDITY                | HUMIDITY                | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 1    | 0    |
| HURRICANES              | HURRICANES              | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 3    | 1    | 1    | 2    | 0    | 1    | 0    | 1    | 0    | 0    |
| LIDAR                   | LIDAR                   | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 1    | 0    | 0    | 0    | 1    | 2    | 0    | 1    | 1    | 0    |
| LIGHTNING               | LIGHTNING               | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 1    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 1    | 0    | 1    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 2    | 0    | 3    | 0    | 0    | 1    | 0    | 0    |
| NATURAL_HAZARDS         | NATURAL_HAZARDS         | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    |
| OCEAN_CIRCULATION       | OCEAN_CIRCULATION       | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 2    | 0    | 0    | 0    |
| OSCILLATIONS            | OSCILLATIONS            | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| Others                  | Others                  | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 2    | 0    | 2    | 0    | 0    | 1    | 0    |
| PRECIPITATION           | PRECIPITATION           | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 1    | 0    | 0    | 1    | 3    | 2    | 0    | 6    | 0    | 0    | 3    | 0    | 0    | 0    | 1    | 2    | 1    | 1    | 4    | 3    | 9    | 2    | 1    | 0    | 1    |
| RADAR                   | RADAR                   | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 2    | 0    | 1    | 0    | 1    | 0    | 0    | 0    |
| SEA_SURFACE_TEMPERATURE | SEA_SURFACE_TEMPERATURE | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 1    | 0    | 3    | 1    | 2    | 2    | 1    | 0    | 0    |
| STORM_SURGE             | STORM_SURGE             | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 1    | 5    | 1    | 0    | 1    | 0    | 2    | 0    | 1    | 0    | 0    |
| STORMS                  | STORMS                  | 0    | 0    | 0    | 1    | 2    | 0    | 4    | 2    | 2    | 0    | 0    | 0    | 1    | 1    | 2    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 2    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 1    | 2    | 0    | 1    | 0    | 0    |
| SURFACE_WINDS           | SURFACE_WINDS           | 0    | 1    | 0    | 0    | 1    | 0    | 3    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 2    | 0    | 1    | 0    |
| TIDES                   | TIDES                   | 0    | 0    | 2    | 1    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 1    | 0    | 0    | 0    |
| TORNADOES               | TORNADOES               | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 3    | 0    | 1    | 0    | 0    |
| TROPICAL_CYCLONES       | TROPICAL_CYCLONES       | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 1    | 0    | 1    | 0    | 0    | 1    | 1    | 4    | 3    | 1    | 0    |
| TROPOPAUSE              | TROPOPAUSE              | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 2    | 3    | 0    |
| TURBULENCE              | TURBULENCE              | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| TYPHOONS                | TYPHOONS                | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    |
| VORTICITY               | VORTICITY               | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 1    | 1    | 0    | 3    | 1    | 1    | 0    | 0    |
| WATER_VAPOR             | WATER_VAPOR             | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    |
| WIND_PROFILES           | WIND_PROFILES           | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 1    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    |
| WIND_SHEAR              | WIND_SHEAR              | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 2    | 0    | 0    | 0    | 0    | 0    | 0    |

For the data visulization, we choose to demonstrate our data using the chart, rendered by the html page, to help build with no relief, we utilized the open source Data-Driven Documents(https://d3js.org/) toolkits, it can esily render the page based on your raw datasets.

For running the website, we setup it using Node.js with Express.js framework, and deployed it onto AWS EC2, it can now be accessed by: http://ec2-54-211-222-201.compute-1.amazonaws.com/

The website shows dynamic chart visulize the data retrieved from the dataset in csv format. Just feel free to play with it!


## 4 Lesson Learned

From this team project, we have grabed a lot of techics and knowledge related to data science, utilizing data analytics tools to analyze the data we crawled and then visulize it, the whole process make us feel excited.

At the very begining, we have little experience with gathering data, crawlers, etc. And in this project we practiced heavily, improved from single thread process to multi-thread high-proficiency crawler, we did grow up at data retrieving. We learned the importance of data quality in regards to data processing after, we learned the craw efficiency matters to our project schedule, we keep practicing and finally got the satisfied result.

For the data processing stage, we utilized the standford LLDA toolkit to easily analysing raw cralwed PDF content to the result we want, and finally visulize it with beautiful-looking web demo pages.