GitHub - neggert/KaggleFB: Data processing for Stack Overflow tag extraction Kaggle competion

Data processing using Hadoop and OpenNLP for a facebook-sponsored Kaggle competition. The basic idea is to extract tags from the text of StackOverflow questions

The jar file that is given to Hadoop is produced by mvn package, which creates target/KaggleFB-1.0-SNAPSHOT-job.jar.

Sentence extraction

The data processing happens in two steps. First, we parse the CSV file and extract individual sentences from the body and title fields using the OpenNLP sentence extractor. This uses a custom input format based on the opencsv library. Unfortunately, this step cannot be done in parallel because CSV files with multi-line fields can't be split by Hadoop (or at least, I can't figure out a way).

The sentence extraction step is run like so:

hadoop jar KaggleFB-1.0-SNAPSHOT-job.jar KaggleFB.SentExtractMR KaggleFB/Train.csv KaggleFB/sent_output

This will produce a dataset in the sent_output directory with the format

id  source  content tags

where the fields are defined as

id - original post id, as defined in Train.csv
source - either body or title depending on whether the sentence came from a post body or title
content - the sentence
tags - the tags associated with the post this sentence came from

Feature extraction

The feature extraction step is run on the output of the sentence extraction step. The code is run like so:

hadoop jar KaggleFB-1.0-SNAPSHOT-job.jar KaggleFB.FeatExtractMR KaggleFB/sent_output/ KaggleFB/feat_output

This tokenizes the sentences into words using OpenNLP. It then extracts a set of features and a target for each word. The targets can be one of

NOTAG (0) - the word is not part of a tag
TAGSTART(1) - the word is the first word of a multi-word tag, or a single word tag
TAGMID (2) - the word is part of a multi-word tag, but not the first word

Additionally, there is a special value for the previous target feature when looking at the first word of a sentence:

START (3)

The features are subject to change as I work on this but they are currently:

previous word
next word
previous target
word
is word capitalized?

The output file is in the format:

source  tags    prevWord    nextWord    prevTarget  word    cap?    target

The plan is to train a MaxEnt model using Mahout that will determine whether or not a given word is tag-like.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/main		src/main
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main

src/main

pom.xml

pom.xml

readme.md

readme.md

Repository files navigation

Sentence extraction

Feature extraction

About

Releases

Packages

Languages

neggert/KaggleFB

Folders and files

Latest commit

History

Repository files navigation

Sentence extraction

Feature extraction

About

Resources

Stars

Watchers

Forks

Languages