Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.


To make the distractor generation to work, you have to start the following servers
The following 1) and 2) can be started via eclipse
1) POSTagger sever running in  port 8080
2) Supersense tagger running in port 8081

The following 3) should be started from the command line
3) Python wordnet server running in port 8030
U need bottle framework for that
Install the latest stable release with "sudo pip install bottle "," easy_install -U bottle" 
Run it using the command
:~/workspace/QuestionGeneration$ cd pythonscripts/
:~/workspace/QuestionGeneration/pythonscripts$ python


To make the wikipedia topic extraction to work
You need nodejs and related dependencies

and then run the test.js file
using command : phantomjs test.js

Question Generation via Overgenerating Transformations and Ranking
Michael Heilman and Noah A. Smith

A software package for generating questions about the factual information
present in a given text.


This is research code.  Though we have attempted to 
make it fairly well organized and reasonably robust to messy text input,
it is not perfectly engineered.
If you decide to use it, do not expect support,
either with getting it running or with making modifications.
Use at your own risk :-)

Licensing Information

This software is distributed under the GNU GPL license.  
See licenses/LICENSE.txt for details.
Licenses for libraries used by the system are also included 
in the "licenses" directory.

-Apache Commons Lang (
-Apahce Commons Logging (
-JUnit (
-Stanford NLP tools (
-WordNet (
-The sst-light-0.4 release of the SuperSenseTagger, from which we used the SemCor data for training the supersense tagger (
-The Semcor corpus, used for training the supersense tagger (
-The WEKA toolkit, version 3.6.0 (

Running the system

To run the program, execute the script 
(or just run the command it includes).  It takes plain text on standard input
and prints questions on standard output.  You may want to
run the parsing and tagging servers first (see next section).

Version 1.6.0_07 of Java was used in developing the system.
The code is packaged up for use on UNIX systems,
or for use in the Eclipse IDE.

Running the parsing and tagging servers

Before running the program, you may want to 
start the socket servers for the parser and supersense tagger
(if you do this, then the actual question generation program
will require less memory).
If the socket server is running, then the system
will not load up the Stanford Parser, which takes 
a lot of time and memory.

To start the parser socket server, execute

You can change whether the Stanford Parser uses
a lexicalized (englishFactored.ser.gz, the default) 
or and unlexicalized grammar (englishPCFG.ser.gz).

To change what port the socket server uses, modify
the script and the file

To change which grammar the system will load if the socket server is not running,
modify the parserGrammarFile property in 

There is also a supersense tagging server for deciding WH words 
from high-level word senses (see below).  To run it, execute the script

Running the program without the script:

The main class for generating questions is edu.cmu.ark.QuestionAsker.
You can create questions from a text given on STDIN by 
calling this class's main method, as in the script.
There are several command-line arguments:

tell the program to print out debugging information about 
the step-by-step process of making questions

tell the program to print out questions in tab-delimited verbose format,
with source sentences, scores, etc.

--model PATH
load the ranking model at PATH 
(e.g., models/linear-regression-ranker-reg500.ser.gz)

Tell the program to keep questions with unresolved pronouns 
(e.g., What did he like?), which are excluded from the output by
default.  Questions with answers that are unresolved pronouns
(e.g., Who liked pizza?) are not dropped.

Tell the program to keep questions with unresolved pronouns
but also downweight them so they appear towards the end of the ranked list.

Tell the program to downweight questions whose answers
are noun phrases that appear very frequently in the input text
(5+ times and constituting 5+% of non-stop word nouns).

--properties PATH
Set the properties file that tells the program
where to look for resources 
(config/ by default).

Tell the program to downweight yes-no questions so
that they appear mostly behind WH questions in the ranked list

Tell the program to exclude yes-no questions from 
the output (i.e., to only output WH questions)

Perform full noun phrase clarification, replacing coreferent noun
phrases with their first mentions in the text (as identified using
the ARKref coreference tool).  
By default, the system only resolves (i.e., replaces) pronouns.

Do not replace any pronouns or other noun phrases with antecedent mentions.
By default, pronouns are replaced.

--max-length N
Skip any questions longer than N tokens.

Ranking model

A question ranking model trained on human judgments of output from stages 1 and 2 is provided here: 

Unit Testing

A suite of JUnit tests is provided to ensure that the system is working properly. 
See,, and
(under src/edu/cmu/ark).  These unit tests were developed with JUnit 3.8.2 (

Several other packages are incorporated into the system.
See the "licenses" subdirectory for licensing information.

Stanford Parser, Tregex and Tsurgeon

See the distributions of the Stanford Parser and NER system for more information about how they work,
their source code, licenses (also GNU), etc.


See the webpage or distribution of the WEKA machine learning package for more information
about how it works, its license (also GNU), etc.  The system uses WEKA version 3.6.


See the webpage of the Java WordNet Library for more information about 
how it works, its license, etc.  The system uses version 1.4.1 rc2.

ARKref and Supersense Tagging libraries

ARKref (lib/arkref.jar) is a simple noun phrase coreference resolution module.
It is based on the syntactic module described in Haghighi & Klein, EMNLP 2009.
This package includes version 20110321 of ARKref.
For more details, see

The system also uses a supersense tagger (lib/supersense-tagger.jar) to decide appropriate WH words.  The tagger labels word tokens with high-level semantic types (e.g., noun.person, noun.time, etc.).  It is a java reimplementation of the system described in the following paper:

M. Ciaramita and Y. Altun.  2006.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger.  In Proc. of EMNLP.

For a standalone version of the SST, see

Language model

The language model config/anc-v2-written.lm.gz
was created from the written portion of the  American National Corpus 
Second Release (
using the SRILM toolkit.

Papers and website

The system is described in the following work:
-M. Heilman and N. A. Smith. 2010.  M. Heilman and N. A. Smith. 2010. Good Question! Statistical Ranking for Question Generation. In Proc. of NAACL/HLT.
-M. Heilman and N. A. Smith. 2009. Question Generation via Overgenerating Transformations and Ranking. Language Technologies Institute, Carnegie Mellon University Technical Report CMU-LTI-09-013.
-Michael Heilman's Ph.D. dissertation.

Website for the system:


Generating Questions and Distractors automatically from Multimedia. Undergraduate Thesis work.



No releases published


No packages published