Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
src
 
 
 
 
 
 
 
 

README.md

KOSHIK

An NLP framework for large scale processing using Hadoop. KOSHIK supports parsing of text in multiple languages including English, Swedish, and Chinese.

USAGE

Before processing a corpus, the corpus must be imported into Koshik. Koshik supports import from plain text, CoNLL2006/2009, and Wikipedia XML dumps. To import from a Wikipedia XML dump file, run:

hadoop jar Koshik-1.0.1.jar se.lth.cs.koshik.util.Import -input /enwiki-20140102-pages-articles.xml -inputformat wikipedia -language eng -charset utf-8 -output /enwiki_avro

The imported documents can then be parsed using the analysis tools in Koshik. To parse using an English semantic role labeler, run:

hadoop jar Koshik-1.0.1.jar se.lth.cs.koshik.util.EnglishPipeline -D mapred.reduce.tasks=12 -D mapred.child.java.opts=-Xmx8G -archives model.zip -input /enwiki_avro -output /enwiki_semantic

Querying data through HIVE

  • Importing data into Hive:

    CREATE EXTERNAL TABLE koshikdocs ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/hivetablekoshik' TBLPROPERTIES('avro.schema.url'='hdfs:///AvroDocument.avsc');
    LOAD DATA INPATH '/enwiki_semantic/*.avro' INTO TABLE koshikdocs;

  • Number of articles:

    SELECT count(identifier) from koshikdocs;

  • Number of sentences:

    SELECT count(ann) FROM koshikdocs LATERAL VIEW explode(annotations.layer) annTable as ann WHERE ann LIKE '%Sentence';

  • Number of tokens:

    SELECT count(ann) FROM koshikdocs LATERAL VIEW explode(annotations.layer) annTable as ann WHERE ann LIKE '%Token';

  • Number of nouns:

    SELECT count(key) FROM (SELECT explode(ann) AS (key,value) FROM (SELECT ann FROM koshikdocs LATERAL VIEW explode(annotations.features) annTable as ann) annmap) decmap WHERE key='POSTAG' AND value LIKE 'NN%';

NLP Model files

The language model files for the tools used in KOSHIK can be downloaded from the following sites:

References

Please cite the following paper, if you use KOSHIK:

About

An NLP framework for large scale processing using Hadoop

Resources

Packages

No packages published

Languages

You can’t perform that action at this time.