Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Changes to ArrayListOfInts and Hooka code. Also added some test cases for changed method. #31

Merged
merged 2 commits into from

2 participants

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 5, 2012
  1. @ferhanture
Commits on Jan 25, 2012
  1. @ferhanture

    Updated merge method in ArrayListOfInts, added test cases. Also updat…

    ferhanture authored
    …ed hookas prob return methods
This page is out of date. Refresh to see the latest.
View
51 docs/hooka.html
@@ -22,13 +22,17 @@
p.p15 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 10.0px Verdana}
p.p16 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 11.0px 'Helvetica Neue'}
p.p18 {margin: 0.0px 0.0px 14.0px 0.0px; font: 14.0px Verdana; color: #555555}
+ p.p19 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Verdana; color: #606060}
+ p.p20 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Verdana; color: #606060; min-height: 15.0px}
span.s1 {text-decoration: underline ; color: #0000ee}
span.s2 {font: 13.0px Courier; color: #000000}
- span.s3 {color: #901568}
- span.s4 {color: #4a23fe}
- span.s5 {color: #8f156f}
- span.s6 {color: #483ef5}
- span.s7 {font: 12.0px Times}
+ span.s3 {font: 11.0px Monaco; color: #000000}
+ span.s4 {color: #606060}
+ span.s5 {color: #901568}
+ span.s6 {color: #4a23fe}
+ span.s7 {color: #8f156f}
+ span.s8 {color: #483ef5}
+ span.s9 {font: 12.0px Times}
span.Apple-tab-span {white-space:pre}
table.t1 {border-collapse: collapse}
td.td1 {width: 92.0px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #cbcbcb #cbcbcb #cbcbcb #cbcbcb; padding: 0.0px 5.0px 0.0px 5.0px}
@@ -61,20 +65,20 @@ <h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 16.0px Verdana; color: #55555
<p class="p8"><br></p>
<p class="p4">The first argument is the HDFS path to the input data in XML format, as described above. The second argument is the HDFS path to the working directory, to which data is written. The next two arguments indicate the language code of source and target language. Fifth and sixth arguments show the number of EM iterations using IBM Model1 and HMM, respectively. The last argument indicates whether tokenization should be done.<span class="Apple-converted-space"> </span></p>
<p class="p5">usage: [input-path] [root-path] [src-lang] [trg-lang] [number-of-model1-iters] [number-of-hmm-iters] (local)</p>
-<p class="p9">(Note: Use last argument only for local runs.)</p>
+<p class="p9">(Note: Enter last argument only for local runs.)</p>
<p class="p6"><br></p>
<p class="p4">The program starts by running a Hadoop job that preprocesses the dataset and performs tokenization/truncation. If the input text is already tokenized/stemmed, then you can opt out of doing it here by setting the last argument to false. After preprocessing, the program runs the EM iterations, each consisting of a computation step and merging step, in two separate Hadoop jobs. An alignment step completes the execution of each set of iterations.</p>
-<p class="p4">The output of the program consists of two vocabularies (source-side vocabulary <span class="s2">vocab.E</span>, target-side vocabulary <span class="s2">vocab.F</span>) and a lexical conditional probability table. Each vocabulary is represented by an instance of the <span class="s2">Vocab</span> class, as a mapping from terms in the language to a unique integer identifier. The probability table is represented by an instance of the <span class="s2">TTable</span> class, which contains all possible translations (with respective conditional probabilities) of each term in the target language (i.e., P(f|e) for all e in target vocabulary). In order to generate conditional probabilities in the other direction (i.e., P(e|f)) you should run Hooka with the language arguments swapped:</p>
+<p class="p4">The output of the program consists of two vocabularies (source-side vocabulary <span class="s2">vocab.E</span>, target-side vocabulary <span class="s2">vocab.F</span>) and a lexical conditional probability table. Each vocabulary is represented by an instance of the <span class="s2">Vocab</span> class, as a mapping from terms in the language to a unique integer identifier. The probability table is represented by an instance of the <span class="s3">TTable_monolithic_IFAs </span>class <span class="s4">(implements </span><span class="s3">TTable</span><span class="s4">)</span>, which contains all possible translations (with respective conditional probabilities) of each term in the target language (i.e., P(f|e) for all e in target vocabulary). In order to generate conditional probabilities in the other direction (i.e., P(e|f)) you should run Hooka with the language arguments swapped:</p>
<p class="p5">$ bin/hadoop jar cloud9.jar $DATADIR/europarl-v6.de-en.xml $WORKDIR en de 5 5 true</p>
<p class="p6"><br></p>
<p class="p4">Once the vocabulary and ttable are written to disk, they can be loaded into memory and used for certain operations. For instance, a <span class="s2">Vocab</span> object can be used to retrieve words as follows:</p>
-<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>Vocab engVocab = HadoopAlign.loadVocab(<span class="s3">new</span> Path(vocabHDFSPath), hdfsConf);</p>
-<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span><span class="s3">int</span> eId = engVocab.get(<span class="s4">"book"</span>);</p>
+<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>Vocab engVocab = HadoopAlign.loadVocab(<span class="s5">new</span> Path(vocabHDFSPath), hdfsConf);</p>
+<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span><span class="s5">int</span> eId = engVocab.get(<span class="s6">"book"</span>);</p>
<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>String eString = engVocab.get(eId);<span class="Apple-converted-space"> </span></p>
<p class="p8"><br></p>
<p class="p4">A <span class="s2">TTable</span> object can be used to find conditional probabilities as follows:</p>
-<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>TTable_monolithic_IFAs ttable_en2de = <span class="s3">new</span> TTable_monolithic_IFAs(FileSystem.get(hdfsConf), <span class="s3">new</span> Path(<span class="s4">ttableHDFSPath</span>), <span class="s3">true</span>);</p>
-<p class="p10"><span class="s4"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span></span><span class="s5">float</span><span class="s4"> </span>prob<span class="s4"> = </span>ttable_en2de.get(eId,fId);</p>
+<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>TTable_monolithic_IFAs ttable_en2de = <span class="s5">new</span> TTable_monolithic_IFAs(FileSystem.get(hdfsConf), <span class="s5">new</span> Path(<span class="s6">ttableHDFSPath</span>), <span class="s5">true</span>);</p>
+<p class="p10"><span class="s6"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span></span><span class="s7">float</span><span class="s6"> </span>prob<span class="s6"> = </span>ttable_en2de.get(eId,fId);</p>
<p class="p11"><br></p>
<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>// find all German translations of "book"</p>
<p class="p10"><span class="Apple-tab-span"> </span><span class="Apple-tab-span"> </span>int[] fIdArray = ttable_en2de.get(eId).getTranslations(0.1f);<span class="Apple-tab-span"> </span></p>
@@ -82,8 +86,8 @@ <h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 16.0px Verdana; color: #55555
<p class="p3"><b>Simplifying the Translation Table</b></p>
<p class="p12">Alignment tools use statistical smoothing techniques to distribute the probability mass more conservatively. This results in hundreds or even thousands of translations per term in the vocabulary. However, for many applications, one may only need the most probable few translations. This may reduce redundancy in the <span class="s2">TTable</span> object, as well as decrease noise in the distributions used to estimate possible translations of source words. We empirically decide on the following simlifying heuristic: Only include the most probable 15 translations of each source term, unless the total sum of probabilities exceed 0.9 with less than 15. Ivory provides the necessary code to simplify a given Hooka <span class="s2">TTable</span> object via the <span class="s2">ivory.util.CLIRUtils</span> class:</p>
<p class="p13"><br></p>
-<p class="p10">CLIRUtils.createTTableFromHooka(HookaWorkDir+<span class="s6">"/de-en/vocab.F", </span>HookaWorkDir+<span class="s6">"/de-en/vocab.E, </span>HookaWorkDir+<span class="s6">"/de-en/tmp.ttable", </span>HookaWorkDir+<span class="s6">"/de-en-simple/vocab.F", </span>HookaWorkDir+<span class="s6">"/de-en-simple/vocab.E, </span>HookaWorkDir+<span class="s6">"/de-en-simple/tmp.ttable", </span>FileSystem.get(hdfsConf));</p>
-<p class="p10">CLIRUtils.createTTableFromHooka(HookaWorkDir+<span class="s6">"/en-de/vocab.E, </span>HookaWorkDir+<span class="s6">"/en-de/vocab.F, </span>HookaWorkDir+<span class="s6">"/en-de/tmp.ttable"</span>, HookaWorkDir+<span class="s6">"/en-de-simple/vocab.E, </span>HookaWorkDir+<span class="s6">"/en-de-simple/vocab.F, </span>HookaWorkDir+<span class="s6">"/en-de-simple/tmp.ttable"</span>, FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromHooka(HookaWorkDir+<span class="s8">"/de-en/vocab.F", </span>HookaWorkDir+<span class="s8">"/de-en/vocab.E, </span>HookaWorkDir+<span class="s8">"/de-en/tmp.ttable", </span>HookaWorkDir+<span class="s8">"/de-en-simple/vocab.F", </span>HookaWorkDir+<span class="s8">"/de-en-simple/vocab.E, </span>HookaWorkDir+<span class="s8">"/de-en-simple/tmp.ttable", </span>FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromHooka(HookaWorkDir+<span class="s8">"/en-de/vocab.E, </span>HookaWorkDir+<span class="s8">"/en-de/vocab.F, </span>HookaWorkDir+<span class="s8">"/en-de/tmp.ttable"</span>, HookaWorkDir+<span class="s8">"/en-de-simple/vocab.E, </span>HookaWorkDir+<span class="s8">"/en-de-simple/vocab.F, </span>HookaWorkDir+<span class="s8">"/en-de-simple/tmp.ttable"</span>, FileSystem.get(hdfsConf));</p>
<p class="p6"><br></p>
<h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 16.0px Verdana; color: #555555"><b>Evaluation</b></h2>
<h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 12.0px Verdana; color: #555555">We evaluated Hooka by comparing it to two popular word alignment tools: GIZA++ and berkeleyAligner. We designed an intrinsic evaluation to test the quality of the conditional probability values output by each system. We experimented with the German and English portions of the Europarl corpus, which contains proceedings from the European Parliament. We constructed artificial documents by concatenating every 10 consecutive sentences into a single document. In this manner, we sampled 505 document pairs that are mutual translations of each other (and therefore semantically similar by construction). This provides ground truth to evaluate the effectiveness of the three systems on the task of pairwise similarity.<span class="Apple-converted-space"> </span></h2>
@@ -142,7 +146,7 @@ <h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 12.0px Verdana; color: #55555
<p class="p15">0.449</p>
</td>
<td valign="middle" class="td3">
- <p class="p15">0.114<span class="s7"><span class="Apple-converted-space"> </span></span></p>
+ <p class="p15">0.114<span class="s9"><span class="Apple-converted-space"> </span></span></p>
</td>
</tr>
</tbody>
@@ -185,15 +189,22 @@ <h2 style="margin: 0.0px 0.0px 14.0px 0.0px; font: 12.0px Times; min-height: 14.
<p class="p6"><br></p>
<p class="p5">$ sort -k2 $GIZAWorkDir/lex.0-0.n2f -o $GIZAWorkDir/lex.0-0.n2f.sorted</p>
<p class="p6"><br></p>
-<p class="p12">Next, use Hooka to convert each file into a TTable object and a pair of Vocab objects:</p>
+<p class="p12">Next, use Hooka to convert each file into a <span class="s2">TTable</span> object and a pair of <span class="s2">Vocab</span> objects:</p>
<p class="p11"><br></p>
-<p class="p10">CLIRUtils.createTTableFromGIZA(GIZAWorkDir+<span class="s4">"/lex.0-0.n2f.sorted"</span>, e2f_eVocabFile, e2f_fVocabFile, e2f_ttableFile, FileSystem.get(hdfsConf));</p>
-<p class="p10">CLIRUtils.createTTableFromGIZA(GIZAWorkDir+<span class="s4">"/lex.0-0.f2n"</span>, f2e_fVocabFile, f2e_eVocabFile, f2e_ttableFile, FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromGIZA(GIZAWorkDir+<span class="s6">"/lex.0-0.n2f.sorted"</span>, e2f_eVocabFile, e2f_fVocabFile, e2f_ttableFile, FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromGIZA(GIZAWorkDir+<span class="s6">"/lex.0-0.f2n"</span>, f2e_fVocabFile, f2e_eVocabFile, f2e_ttableFile, FileSystem.get(hdfsConf));</p>
<p class="p6"><br></p>
<p class="p18"><b>2. berkeleyAligner</b></p>
-<p class="p12">The output of berkeleyAligner is similar to GIZA++, but you don't need to do any preprocessing before converting with Hooka. Here is an example command line:</p>
+<p class="p12">The output of berkeleyAligner is similar to GIZA++, but you don't need to do any preprocessing before converting with Hooka. Here is an example method call:</p>
<p class="p6"><br></p>
-<p class="p10">CLIRUtils.createTTableFromBerkeleyAligner(berkeleyWorkDir+<span class="s4">"/stage2.1.params.txt"</span>, e2f_eVocabFile, e2f_fVocabFile, e2f_ttableFile, FileSystem.get(hdfsConf));</p>
-<p class="p10">CLIRUtils.createTTableFromBerkeleyAligner(berkeleyWorkDir+<span class="s4">"/stage2.2.params.txt"</span>, f2e_fVocabFile, f2e_eVocabFile, f2e_ttableFile, FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromBerkeleyAligner(berkeleyWorkDir+<span class="s6">"/stage2.1.params.txt"</span>, e2f_eVocabFile, e2f_fVocabFile, e2f_ttableFile, FileSystem.get(hdfsConf));</p>
+<p class="p10">CLIRUtils.createTTableFromBerkeleyAligner(berkeleyWorkDir+<span class="s6">"/stage2.2.params.txt"</span>, f2e_fVocabFile, f2e_eVocabFile, f2e_ttableFile, FileSystem.get(hdfsConf));</p>
+<p class="p11"><br></p>
+<p class="p19">For convenience, CLIRUtils has a main method that will run from the command line:</p>
+<p class="p20"><br></p>
+<p class="p10">usage: [input-lexicalprob-file_f2e] [input-lexicalprob-file_e2f] [type=giza|berkeley] [src-vocab_f] [trg-vocab_e] [prob-table_f--&gt;e] [src-vocab_e] [trg-vocab_f] [prob-table_e--&gt;f]</p>
+<p class="p11"><br></p>
+<p class="p19">First two arguments are the output files of GIZA++ or berkeleyAligner, and they need to be on the local file system. The last six arguments are paths for the output files, written to disk as <span class="s2">TTable</span> and <span class="s2">Vocab</span> objects.</p>
+<p class="p11"><br></p>
</body>
</html>
View
44 src/dist/edu/umd/cloud9/util/array/ArrayListOfInts.java
@@ -362,7 +362,7 @@ public ArrayListOfInts intersection(ArrayListOfInts other) {
}
/**
- * Merges two sorted (ascending order) lists into one sorted union.
+ * Merges two sorted (ascending order) lists into one sorted union. Duplicate items remain in the merged list.
*
* @param sortedLst list to be merged into this
* @return merged sorted (ascending order) union of this and sortedLst
@@ -390,6 +390,48 @@ public ArrayListOfInts merge(ArrayListOfInts sortedLst) {
return result;
}
+
+ /**
+ * Merges two sorted (ascending order) lists into one sorted union. Duplicate items are discarded in the merged list.
+ *
+ * @param sortedLst list to be merged into this
+ * @return merged sorted (ascending order) union of this and sortedLst
+ */
+ public ArrayListOfInts mergeNoDuplicates(ArrayListOfInts sortedLst) {
+ ArrayListOfInts result = new ArrayListOfInts();
+ int indA = 0, indB = 0;
+ while (indA < this.size() || indB < sortedLst.size()) {
+ // if we've iterated to the end, then add from the other
+ if (indA == this.size()) {
+ if (!result.contains(sortedLst.get(indB))) {
+ result.add(sortedLst.get(indB));
+ }
+ indB++;
+ continue;
+ } else if (indB == sortedLst.size()) {
+ if (!result.contains(this.get(indA))) {
+ result.add(this.get(indA));
+ }
+ indA++;
+ continue;
+ } else {
+ // append the lesser value
+ if (this.get(indA) < sortedLst.get(indB)) {
+ if (!result.contains(this.get(indA))) {
+ result.add(this.get(indA));
+ }
+ indA++;
+ } else {
+ if (!result.contains(sortedLst.get(indB))) {
+ result.add(sortedLst.get(indB));
+ }
+ indB++;
+ }
+ }
+ }
+
+ return result;
+ }
/**
* Extracts a sub-list.
View
34 src/dist/edu/umd/hooka/alignment/IndexedFloatArray.java
@@ -259,15 +259,15 @@ public final float getLazy(int n) {
public int[] getTranslations(float probThreshold){
ArrayListOfInts words = new ArrayListOfInts();
- if(_useBinSearch){
- for(int i=0;i<_data.length;i++){
- if(_data[i]>probThreshold){
+ if (_useBinSearch) {
+ for (int i=0; i < _data.length; i++) {
+ if (_data[i] > probThreshold) {
words.add(_indices[i]);
}
}
}else{
- for(int i=0;i<_data.length;i++){
- if(_data[i]>probThreshold){
+ for (int i=0; i < _data.length; i++) {
+ if (_data[i] > probThreshold) {
words.add(i);
}
}
@@ -276,17 +276,17 @@ public final float getLazy(int n) {
return words.getArray();
}
- public PriorityQueue<PairOfFloatInt> getTranslationsWithProbs(){
+ public PriorityQueue<PairOfFloatInt> getTranslationsWithProbs(float probThreshold){
PriorityQueue<PairOfFloatInt> q = new PriorityQueue<PairOfFloatInt>();
- if(_useBinSearch){
- for(int i=0;i<_data.length;i++){
- if(_data[i]>0.01){
+ if (_useBinSearch) {
+ for (int i=0; i < _data.length; i++) {
+ if (_data[i] > probThreshold) {
q.add(new PairOfFloatInt(_data[i],_indices[i]));
}
}
}else{
- for(int i=0;i<_data.length;i++){
- if(_data[i]>0.01){
+ for (int i=0; i < _data.length; i++) {
+ if (_data[i] > probThreshold) {
q.add(new PairOfFloatInt(_data[i],i));
}
}
@@ -294,17 +294,17 @@ public final float getLazy(int n) {
return q;
}
- public List<PairOfFloatInt> getTranslationsWithProbsAsList(){
+ public List<PairOfFloatInt> getTranslationsWithProbsAsList(float probThreshold){
List<PairOfFloatInt> l = new ArrayList<PairOfFloatInt>();
- if(_useBinSearch){
- for(int i=0;i<_data.length;i++){
- if(_data[i]>0.01){
+ if (_useBinSearch) {
+ for(int i=0; i < _data.length; i++){
+ if (_data[i] > probThreshold) {
l.add(new PairOfFloatInt(_data[i],_indices[i]));
}
}
}else{
- for(int i=0;i<_data.length;i++){
- if(_data[i]>0.01){
+ for (int i=0; i < _data.length; i++) {
+ if (_data[i] > probThreshold) {
l.add(new PairOfFloatInt(_data[i],i));
}
}
View
55 src/test/edu/umd/cloud9/util/array/ArrayListOfIntsTest.java
@@ -435,6 +435,61 @@ public void testMerge3() {
}
@Test
+ public void testMerge4() {
+ //CASE: both lists have the same object
+
+ ArrayListOfInts a = new ArrayListOfInts();
+ a.add(3);
+ a.add(7);
+ a.add(10);
+
+ ArrayListOfInts b = new ArrayListOfInts();
+ b.add(7);
+ b.add(8);
+
+ ArrayListOfInts c = a.merge(b);
+ assertEquals(c.size(), 5);
+ assertEquals(c.get(0), 3);
+ assertEquals(c.get(1), 7);
+ assertEquals(c.get(2), 7);
+ assertEquals(c.get(3), 8);
+ assertEquals(c.get(4), 10);
+
+ ArrayListOfInts cNoDups = a.mergeNoDuplicates(b);
+ assertEquals(cNoDups.size(), 4);
+ assertEquals(cNoDups.get(0), 3);
+ assertEquals(cNoDups.get(1), 7);
+ assertEquals(cNoDups.get(2), 8);
+ assertEquals(cNoDups.get(3), 10);
+ }
+
+ @Test
+ public void testMerge5() {
+ //CASE: both lists have the same object
+
+ ArrayListOfInts a = new ArrayListOfInts();
+ a.add(3);
+ a.add(7);
+ a.add(10);
+
+ ArrayListOfInts b = new ArrayListOfInts();
+ b.add(7);
+
+ ArrayListOfInts c = a.merge(b);
+ assertEquals(c.size(), 4);
+ assertEquals(c.get(0), 3);
+ assertEquals(c.get(1), 7);
+ assertEquals(c.get(2), 7);
+ assertEquals(c.get(3), 10);
+
+ ArrayListOfInts cNoDups = a.mergeNoDuplicates(b);
+ assertEquals(cNoDups.size(), 3);
+ assertEquals(cNoDups.get(0), 3);
+ assertEquals(cNoDups.get(1), 7);
+ assertEquals(cNoDups.get(2), 10);
+ }
+
+ @Test
public void testSubList() {
ArrayListOfInts a = new ArrayListOfInts(new int[] {1, 2, 3, 4, 5, 6, 7});
ArrayListOfInts b = a.subList(1, 5);
Something went wrong with that request. Please try again.