Skip to content

neggert/KaggleFB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Data processing using Hadoop and OpenNLP for a facebook-sponsored Kaggle competition. The basic idea is to extract tags from the text of StackOverflow questions

The jar file that is given to Hadoop is produced by mvn package, which creates target/KaggleFB-1.0-SNAPSHOT-job.jar.

Sentence extraction

The data processing happens in two steps. First, we parse the CSV file and extract individual sentences from the body and title fields using the OpenNLP sentence extractor. This uses a custom input format based on the opencsv library. Unfortunately, this step cannot be done in parallel because CSV files with multi-line fields can't be split by Hadoop (or at least, I can't figure out a way).

The sentence extraction step is run like so:

hadoop jar KaggleFB-1.0-SNAPSHOT-job.jar KaggleFB.SentExtractMR KaggleFB/Train.csv KaggleFB/sent_output

This will produce a dataset in the sent_output directory with the format

id  source  content tags

where the fields are defined as

  • id - original post id, as defined in Train.csv
  • source - either body or title depending on whether the sentence came from a post body or title
  • content - the sentence
  • tags - the tags associated with the post this sentence came from

Feature extraction

The feature extraction step is run on the output of the sentence extraction step. The code is run like so:

hadoop jar KaggleFB-1.0-SNAPSHOT-job.jar KaggleFB.FeatExtractMR KaggleFB/sent_output/ KaggleFB/feat_output

This tokenizes the sentences into words using OpenNLP. It then extracts a set of features and a target for each word. The targets can be one of

  • NOTAG (0) - the word is not part of a tag
  • TAGSTART(1) - the word is the first word of a multi-word tag, or a single word tag
  • TAGMID (2) - the word is part of a multi-word tag, but not the first word

Additionally, there is a special value for the previous target feature when looking at the first word of a sentence:

  • START (3)

The features are subject to change as I work on this but they are currently:

  • previous word
  • next word
  • previous target
  • word
  • is word capitalized?

The output file is in the format:

source  tags    prevWord    nextWord    prevTarget  word    cap?    target

The plan is to train a MaxEnt model using Mahout that will determine whether or not a given word is tag-like.

About

Data processing for Stack Overflow tag extraction Kaggle competion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages