Event Data Processing Pipeline
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
corenlp
petrach name the folder into lower case Jan 15, 2016
queue
test
utilities
README.md
cook.sh
date_formatter.py
gigaword_loader.py
producer.py
sqlite_mongo_bulk.py

README.md

biryani

Installing

Step 1: Install docker.

Step 2: Install python.

Step 3: Install RabbitMQ

Step 4: Choosing Brach

master brach is the ligth weight version of biryani which has only specified annotators(annotators, tokenize, ssplit, pos, parse)

If you want to run biryani will all annotators refer to brach All-Annotators

If you want to run biryani will all annotators and also dynamically decide how many documents to be processed using kalman filter refer to brach kalman_filter_all_anno

Step 5: Sending documents to the RabbitMQ queue

Download the local copy of the biryani repo. Open the file producer.py and make the necessary changes according to how your RabbitMQ server is setup

Once the changes to the producer.py are complete run the file using the following command python producer.py

Step 6: Making changes to corenlp.json and log4j.properties files

Go to biryani/corenlp/ folder, you can find corenlp.json and log4j.properties files

corenlp.json file contains the RabbitMq server configuration information and queue name in which the documents are present
Make sure you make neccessary changes to the corenlp.json file according to how you setup your RabbitMQ server and Queue name

log4j.xml file containes the logging configuration details. Make the necessary changes for the ip address and port you want to use for logging.

<Socket name="socket" host="logstash server host" port="5000">

Step 7: docker-elk
Download the docker-elk repo from
https://github.com/deviantony/docker-elk

Go to the directory docker-elk/logstash/config
you will find logstash.conf

add the following code below the tcp in logstash.conf file

log4j 
{
  port => the port number you added in log4j.xml file
}

Note: Make sure that you add the port number in the docker-compose.yml file of the root directory.
you can find ports section in the file, just add the port you added in logstash.conf here.

Step 8:

Go to the folder corenlp and run the following command.

docker build -t image-name .

Note: There is a period after image-name, which specifies that Docker file is in current directory.

Example: docker build -t phani\ccnlp:1.0 .

Step 9: To run the image created

docker run image-name java -cp ".:"lib/*" corenlp_worker #threads #documents(batch size) #Log_token #Database Name

Note: Be careful with the image name you give. If the image is not present, docker searches for the image in the dockerhub and if there an image it would download the image and run the for you.

Example: docker run phani\ccnlp:1.0 java -cp ".:"lib/*" corenlp_worker 16 200 logging test_database

Step 10: Install Petrarch2

Install petrarch2 by using the following command.

pip install git+https://github.com/openeventdata/petrarch2.git

Step 11: Extracting phrases from corenlp parsed tree and storing them in mongodb

Once the container has parsed all the documents copy the database file to biryani/utilities/ directory
In the directory you can find getPhrases_threads.py. Run the following command

python getPhrases_threads.py corenlp_databasefile.db # documents to be processed per batch  #threads

Example

python getPhrases_threads.py test_database.db 5000 16

The extarcted phrases are stored in test_database_petrarch.db