Skip to content

kenns29/topic-analysis

Repository files navigation

Welcome to Our Team

This tutorial briefly describe how to get started on the project. I will cover the following topics:

Get the project

In order to access and make updates to the project, we have to use git. If you are not familar with git, please check https://git-scm.com/.

Our git server is located at http://vaderserver0.cidse.dhcp.asu.edu:10000/. Feel free to sign up with an user name and password, but please ask me for confirmation. Search for topic-analysis to find our project. You can get the project by

$ git clone http://vaderserver0.cidse.dhcp.asu.edu:10000/hxwang/topic-analysis.git

You will be asked for user name and password. You can avoid typing these everytime by uploading an SSH key. To find out how to do it, go to deploy key page in our git server, and make sure change the remote by

$ git remote set-url origin ssh://gitlab@vaderserver0.cidse.dhcp.asu.edu:2323/hxwang/topic-analysis.git

Every new member should ask me for a developer access in order to make updates to the project. I recommand everyone to create an own branch.

Things to install

In order to get started on the project, there are a few things you will have to install.

nodejs

Our backend used nodejs, you can install it from https://nodejs.org/.

browserify

Our frontend javascript is bundled using browserify, which is a nodejs library, you should learn more about it in http://browserify.org/.

To install browserify, you can do:

$ npm install -g browserify watchify

Note that npm is the package manager for nodejs, you should also be familiar with it. This command also installs watchify, which is a plugin for browserify, you can learn more about it in https://github.com/substack/watchify.

node-java

Although we use nodejs primarily, we still have to use some java libraries for some NLP tasks, such as StanfordCoreNLP and Mallet. To communicate between nodejs and java, we used the node-java library. You should try to get familar with it. Please use the module inside java/java_init.js to load the library and the java codes.

You don't have to install node-java manually, becase I have included the installation command in our npm script (I will cover later). But if you really want to, you can do

$ npm install java

Important For Windows Users

Installing the node-java library can be a bit tricky on windows. First you will need to have node-gyp installed

$ npm install -g node-gyp

But installing this requires you to have python and c++ compiler (usually the one comes with Visual Studio), so it may throw you an error if you don't have these installed. If you don't want to install all these manuallly, I suggest you use Windows-Build-Tools. You can install this simply by:

$ npm install --global windows-build-tools

It will take a while.

java codes

I have wrapped up all the javacode inside /jars/nlptoolkit.jar. This file contained the precompiled java code from my other project nlptoolkit. All it does is just wrapping up some NLP libraries, along with some helper functions. It also contains a slightly modifed Mallet library. Ask me for developer access if you think you need to add more functionalities on this code.

sass

sass is a preprocesser for css language. We use it to create a css bundle in public/css/. To install sass, you need to first install Ruby, and then just do:

$ gem install sass

bower

bower is another package manager for front-end libraries. Although it can be used for everything, we are only using this to install css libraries for our project. Install bower by:

$ npm install -g bower

Development Environment

Installing Dependencies

javascript

After all the necessary compoments are installed, you should run

$ npm install

It will install all necessary dependencies for our project. If the installation throws an error, then most likely it's because node-java installation is unsuccessful. Please refer to the node-java section for detail. If you are not familar with npm. I suggest you take a look at our package.json file. The dependencies field describes all the node libraries we are using. If you want to install additional libraries, please do

$ npm install --save [library]

Important: the --save options adds the library you installed to the dependencies field. This is very important because it allow others to know which libraries you have used.

css

You also need to install necessary css libraries, use the following command:

$ cd public/
$ bower install

Building and Debugging

After installing all the dependencies, now please take a look at scripts field in package.json. These are our npm scripts. Each can be executed by

$ npm run [command-name]

I will explain them one by one:

"scripts": {
  "build-js": "browserify browser/js/main.js -o public/bundle.js -t [babelify --presets [es2015 async-to-generator async-generator-functions]]",
  "build": "npm run build-js & npm run build-css",
  "watch-js": "watchify browser/js/main.js -o public/bundle.js -dv",
  "watch": "npm run watch-js",
  "start": "node app.js",
  "build-css": "sass browser/css/main.scss:public/css/main.css --style compressed",
  "watch-css": "sass --watch browser/css/main.scss:public/css/main.css --style compressed"
}
  • build-js uses browserify to bundle our front-end javascript codes into /public/bundle.js.
  • build-css uses sass to bundle our css into /public/css/main.css.
  • build builds both javascript and css.
  • watch-js uses watchify to watch for any changes made to the front-end javascript code and update /public/bundle.js.
  • watch-css uses sass to watch for any changes made to the css and updates /public/css/main.css.
  • watch watchs the javascript.
  • start starts the back-end server which listens to port 10082.

Typically, during development, you want to open two terminal, one does npm run watch-js and the other does npm run watch-css, this will make sure all changes you made to the front-end can take effect when you refresh the webpage. You also want to open another terminal for npm start, and you want to redo npm start everytime you made any changes to the backend.

Project Structure

The project may look like it has a lot of folders, but essentially, it has only a few major components:

browser/
  css/
    main.scss
    ...
  js/
    main.js
    ...
public/
  css/
  bundle.js
  bower.json
  ...
routes/
  index.js
  ...
views/
  index.ejs
  ...
app.js
package.json
... /* other modules and resources */
  • app.js is our main module, it initialize the backend server. We used the Express library to build our server.
  • routes/ specifies routing behavior for the server. Please read https://expressjs.com/en/guide/routing.html for more infomation about routing.
  • views/ stores the rendering components. We use ejs as our view engine, which has a syntax similar to html.
  • public/ stores the static content for the website, such as the bundled javascript and css files.
  • browser/ is where most of the front-end codes are stored. browser/css stores the css code and browser/js stores the javascript codes. The main.js file is the entry point of all front-end javascript codes.

Database

MongoDB is the database that we use. Our MongoDB server is located at vaderserver0.cidse.dhcp.asu.edu:27017/gender_study.

Examine the Data

There are two ways you can examine our database, one is to log in our server using SSH; the other one is to install a MongoDB client on your local machine.

SSH

You need a SSH terminal to log in our server, plase ask me for a SSH account. Once you log in, do the following:

$ mongo
> use gender_study
MongoDB Client

Refer to https://docs.mongodb.com/manual/administration/install-community/ to install the MongoDB Community Edition in your machine. Once you finish, you can use the same command sequence to access the database.

Using Mongo shell command

You should be familiar with Mongo shell command to navigate the database. You can start with https://docs.mongodb.com/manual/reference/mongo-shell/. Some of the most common command that you will use include db.find() and db.findOne().

MongoDB Nodejs Driver

The web server needs a driver to communicate with the database server, and we use the MongoDB Native Driver for the purpose. There are other alternatives such as Mongoose, but I opt the native driver for simplicity. Beware when you search for online help, and make sure the solution is for the native driver but not others. The driver API is rather too complicated to explain in this tutorial, you can find some example of how to use it in db_mongo/ and routers/. You should also refer to the driver documentation for help. We are using version 2.2 right now.

Database Schema

Collections:

> show collections
papers
panels
models
users
sessions
...
papers

Stores information for papers, it has following schema:

{
  _id : /* ObjectID */,
  id : /* a unique identifier for the paper */,
  title : /* title of the paper in string */,
  year : /* year the paper is published */,
  type : /* type of the paper:
          1 -> academic paper
          2 -> roundtable/workshop paper
          */,
  panel : /* the id of the panel to which this paper belongs */,
  abstract : /* the abstract of the paper in string (Nullable) */,
  title_tokens : [ /* tokenized title in array */
    {
      text : /* original text of the token */,
      index : /* index of the token in the title */,
      sent_index : /* index of the sentence to which the token belongs */,
      index_in_sent : /* index of the token in the sentence */,
      begin_position : /* the character position of the first character of the token */,
      end_position : /* 1 + the end position of the last character of the token */,
      ner : /* named entity of the token */,
      lemma : /* lemma of the token */
    },
    ...
  ],
  abstract_tokens : [ /* tokenized abstract in array (Nullable)*/
    {
      text : /* original text of the token */,
      index : /* index of the token in the abstract */,
      sent_index : /* index of the sentence to which the token belongs */,
      index_in_sent : /* index of the token in the sentence */,
      begin_position : /* the character position of the first character of the token */,
      end_position : /* 1 + the end position of the last character of the token */,
      ner : /* named entity of the token */,
      lemma : /* lemma of the token */
    },
    ...
  ]
}
panels
{
  _id : /* ObjectID */,
  id : /* a unique identifier for the panel */,
  title : /* title of the panel in string */,
  year : /* year of the panel */,
  type : /* type of the panel:
          1 -> academic panel
          2 -> roundtable/workshop panel
          */,
  papers : [ /* array of the id's of papers that belongs to the panel */
    ...
  ],
  abstract : /* the abstract of the panel in string (Nullable) */,
  title_tokens : [ /* tokenized title in array */
    {
      text : /* original text of the token */,
      index : /* index of the token in the title */,
      sent_index : /* index of the sentence to which the token belongs */,
      index_in_sent : /* index of the token in the sentence */,
      begin_position : /* the character position of the first character of the token */,
      end_position : /* 1 + the end position of the last character of the token */,
      ner : /* named entity of the token */,
      lemma : /* lemma of the token */
    },
    ...
  ],
  abstract_tokens : [ /* tokenized abstract in array (Nullable) */
    {
      text : /* original text of the token */,
      index : /* index of the token in the abstract */,
      sent_index : /* index of the sentence to which the token belongs */,
      index_in_sent : /* index of the token in the sentence */,
      begin_position : /* the character position of the first character of the token */,
      end_position : /* 1 + the end position of the last character of the token */,
      ner : /* named entity of the token */,
      lemma : /* lemma of the token */
    },
    ...
  ]
}
models

Stores the topic models

{
  _id : /* ObjectId */,
  id : /* unique identifier for the model */,
  name : /* name of the model */,
  year : /* year of the data for which the model was trained */,
  type : 1 /* type of the data for which the model was trained
              1 -> academic panel
              2 -> roundtable/workshop panel
            */,
  level : 3 /* specifies if the data is paper or panel
              3 -> paper
              4 -> panel
             */,
  field : 5 /* specifies if the model is trained on title or abstract
              5 -> title
              6 -> abstract
             */,
  num_topics : 10 /* number or topics the model has */,
  num_iterations : 2000 /* number of iterations during training of the model */,
  model : /* the serialized java object of the class nlp.edu.asu.vader.mallet.model.TopicModel, which wraps up the Mallet topic model along with some helper functions in binary format
           */
}

Example of how to load the model from the database, from routes/router_load_topic_model.js:

var TopicModel = require('../mallet/topic_model');
var ConnStat = require('../db_mongo/connection');
var model_col = require('../db_mongo/model_col');
var co = require('co');
var mongodb = require('mongodb');
var MongoClient = mongodb.MongoClient;

module.exports = exports = function(req, res){
  var id = Number(req.query.id);
  co(function*(){
    /* Connects to the mongodb database */
    var db = yield MongoClient.connect(ConnStat().url());
    /* point to the models collection */
    var col = db.collection(model_col);
    /* query for the model based on id */
    var data_array = yield col.find({id : id}).toArray();
    db.close();
    var m = data_array[0];
    /* deserialize the model from the binary */
    var topic_model = TopicModel().load_from_binary(m.model.buffer);
    /* return the json formated representation for the model */
    return Promise.resolve(topic_model.get_topics_with_id(20));
  }).then(function(json){
    /* send the json formated model to the front-end */
    res.json(json);
  }).catch(function(err){
    console.log(err);
    res.status(500);
    res.send(err);
  });
};
users

Stores the information for each user

{
  _id : /* ObjectID, this will be used as the unique identifier for user */,
  local : {
  	email : /* email address of the user, also serves as user name */,
  	password : /* password hashed */
  }
}
sessions

Stores user's login session. This is handled by the passport library automatically, generally you won't have to worry about it.

Other Important Libraries and Utilities

d3

d3 is now perhaps "THE" library to do visualization on web. We use it throughout the entire project. If you are not familiar with it right now, please spend some time to become an expert of it. You can start with some online tutorials, and the API Reference is always your friend. Beware that d3 has two non-compatible versions: v3 and v4. We are only using v4, but it helps if you know both. The similarity between these two versions are much greater than their differences. You can check Changes in D3 4.0 and What Makes Software Good? to know more.

jquery

jquery is the ubiquitous front-end javascript library that does almost everything that's related to the DOM. It is so popular that everyone expects you to know it. So we also use this quite a lot in the project.

AJAX

We use jquery to do AJAX mostly, but you are not constrained to use jquery, you can use d3-request or even plain javascript if you feel like it. But you want to use jquery, you can check jQuery.ajax() for help. Also, browser/js/load/ should give you enough information about how we do ajax in our project.

Promise

Most asynchronous components in our code are wrapped in Promise. Promise is natively supported by javascript since ES6. It makes asynchronous programming easier (avoiding the Callback Hell). Promise is easy to use. But if you are new to asynchronous programming, it will take a while to grasp its concepts. We used Promise throughout the entire project.

generator function

Notice that in many times, we use the generator function and co to simplify our Promise code. For example, this code from browser/js/control/controller_topic_model_selection.js:

co(function*(){
  $(global.topic_viewer.loading()).show();
  var topics = yield LoadTopicModel().id(selected_model.id).load();
  $(global.topic_viewer.loading()).hide();
  yield global.topic_viewer.display_opt('weight').data(topics).update();
  var data = yield global.topic_document_viewer.year(selected_model.year)
    .type(selected_model.type).level(selected_model.level).load();
  global.topic_document_viewer.data(data).update();
}).catch(function(err){
  console.log(err);
  $(global.topic_viewer.loading()).hide();
});

These are handy so it doesn't hurt to know a little bit about them. Also currently, they are only supported by Chrome and Firefox, this is why our project doesn't work in IE right now.

passport

The passport library is used to handel user logins. Our user model is located at auth/user.js and the login and signup handlers are in auth/passport.js.

A little about MVC and Organization of Code

MVC stands for model-view-controller, and according to this wikipedia article, it is a software architectural pattern for implementing user interfaces on computers. It divides software into three components: model, view and controller. In our application, you can think the model is the data, the views are our visualization components, such as "topic view", "document view", "timeline view", etc. And the controllers are responsible for the updates of models and views. Models and views can have their helper functions to wrap up commonly used functionalities. Our views are mostly located in browser/js/view/, our controllers are in browser/js/control, and models are mostly in the backend. I have also seperated the AJAX load and UI components from the MVC, they are in browser/js/load/ and browser/js/UI. This pattern is not rigidly enforced in our project and I opt to not use any framework (angularjs, reactjs, etc) for flexibility. It is up to the developers to decide which is the best way to organize their code, but please beware of our exisiting code structure and the MVC. But the bottom line is you should always modularize your code. If you happen to be a less experienced javascript programmer, please spend some time learn about object oriented programming in javascript, or you can read some of stuff in our project for inspiration. If you happend to be a experienced programmer, I will be very grateful if you can help me improve code organization. However, I would advise against spending too much time on code organization (to a point which hinders your ability to get work done on time). It is bad to write messy code (although often acceptable for research code), but it is worse if we spend too much time on organization and can't catch the deadline (which is not acceptable). So our priority is to get things done, and it's better if you can be neat too.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published