-
Notifications
You must be signed in to change notification settings - Fork 0
A program to aggregate, rank, and search text information
License
neubig/webigator
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
** webigator ** by Graham Neubig ** neubig at gmail dot com Webigator is a program to extract useful information from large amounts of text. In particular, it was developed considering the information needs after disasters, such as: * Locations at which people can evacuate, get water or supplies * Text about the safety of people in the disaster-affected areas * Requests for rescue or disaster supplies At the moment, we are mainly focusing on filtering this information from Twitter. -------------- Install ------------------- The latest version of webigator can be found here: http://www.phontron.com/webigator http://www.github.com/neubig/webigator under the Eclipse Common License. webigator works on linux, and will likely work on Mac or Cygwin. It is dependent on a number of libraries including Boost and XML-RPC, which can be downloaded as follows on Ubuntu: $ sudo apt-get install libboost-all-dev libxmlrpc-c3-dev In addition, for the Perl interface you will want XML::RPC: $ cpan XML::RPC Next, we build the server $ autoreconf -i $ ./configure $ make If this works properly, you should see a help message as follows: src/bin/webigator --help If this doesn't work, please contact the authors. ---------- Setting up the Web Interface ---------- Webigator runs best as a web interface, so we will want to install this too. First, you will need to locate two directories. Your web server home directory, and your cgi-bin directory. If you are running Apache on Ubuntu with the default settings, these are /var/www and /usr/lib/cgi-bin respectively. So, to set up the web server, you will run the following commands: Create the webigator-demo directory: $ mkdir /var/www/webigator-demo Move all of the images, css files, etc to this directory, and all of the program files to the cgi-bin directory: $ cp -r script/cgi/{css,img,js} /var/www/webigator-demo/ $ cp script/cgi/*.{pl,cgi,tpl,pm} /usr/lib/cgi-bin If this works, you will be able to access the webigator tool at: http://localhost/cgi-bin/webigator-demo.cgi (Or replace "localhost" with a different name if you are running on a server.) There are also number of settings in /usr/lib/cgi-bin/settings.pl that you can change, such as the UI language, tokenization method (the default is by characters for Japanese), etc. --------------- Using the tool ------------------- Once install has finished, run the server: $ src/bin/webigator Next, we can create a task by going to the web interface and clicking "create task." We start a program that streams data to the webigator server. This should eventually do something like gather data from the Twitter stream, but for the time being, let's assume that you have some data in tab-separated format in the file data.csv: $ script/data-streamer.pl -data data.csv Next, we access the web interface again, and start labeling the data for the task. -------------------- API ------------------------- The perl programs and server interact through an XML API: ***** Add a keyword add_keyword: id = Data ID (int) text = Data (string) lab = Is the keyword for useful info or not useful info (1 or 0) ***** Add an unlabeled example to be scored add_unlabeled: id = Data ID (int) text = Data (string) ***** Add an instance that has been labeled add_labeled: id = Data ID (int) text = Data (string) lab = Is the data for useful info or not useful info (1 or 0) ***** Return the highest scoring value from the server pop_best: ***** Rescore all of the values on the server: rescore: --------------- TODO ------------------ * Improve the scoring function so it doesn't over-value short text at the beginning * Create a program to stream tweets directly from Twitter
About
A program to aggregate, rank, and search text information
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published