Permalink
Browse files

Last porting of articles.

  • Loading branch information...
1 parent 46cb969 commit bc0e617fbf0ee5d273c1171967d99177fa791163 @jvangael committed Apr 4, 2013
@@ -0,0 +1,42 @@
+---
+layout: post
+title: "ACL 09 & EMNLP 09"
+description: ""
+category:
+tags: [Conference Reports,Language]
+---
+{% include JB/setup %}
+
+ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by [Richard Sproat](http://rws.xoba.com/index.html) who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the [Indus Valey Script](http://en.wikipedia.org/wiki/Indus_script) is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.
+
+The paper had some very cool papers; the first one I really liked was [Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling](http://chasen.org/~daiti-m/paper/acl2009segment.pdf) by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text
+
+ lastly,shepicturedtoherselfhowthissamelittlesisterofhersw
+ ould,intheafter-time,beherselfagrownwoman;andhowshe
+ wouldkeep,throughallherriperyears,thesimpleandlovingh
+ eartofherchildhood:andhowshewouldgatheraboutherothe
+ rlittlechildren,andmaketheireyesbrightandeagerwithmany
+ astrangetale,perhapsevenwiththedreamofwonderlandoo
+ ngago:andhowshewouldfeelwithalltheirsimplesorrows,an
+ dndapleasureinalltheirsimplejoys,rememberingherownc
+ hild-life,andthehappysummerdays. [...]
+
+When the NPYLM is trained on this data, the following is found
+
+ last ly , she pictured to herself how this same little sister
+ of her s would , inthe after - time , be herself agrown woman ;
+ and how she would keep , through allher ripery ears , the simple
+ and loving heart of her child hood : and how she would gather
+ about her other little children ,and make theireyes bright and
+ eager with many a strange tale , perhaps even with the dream of
+ wonderland of longago : and how she would feel with all their
+ simple sorrow s , and a pleasure in all their simple joys ,
+ remember ing her own child - life , and thehappy summerday s .
+
+[A note on the implementation of Hierarchical Dirichlet Processes](http://homepages.inf.ed.ac.uk/pblunsom/pubs/blunsom-acl09-short.pdf) by Phil Blunsom et al. In this paper the authors discuss how previous approximate inference schemes to the HDP collapsed Gibbs sampler can turn out to be quite bad. In this paper they propose a more efficient and exact algorithm for a collapsed Gibbs sampler for the HDP.
+
+A few other papers I really enjoyed:
+* [Minimized Models for Unsupervised Part-of-Speech Tagging](http://www.isi.edu/natural-language/mt/acl09-tag.pdf) by Sujith Ravi et al.
+* [Polylingual Topic Models](http://people.cs.umass.edu/~wallach/publications/mimno09polylingual.pdf) by David Mimno et al.
+* [Graphical Models over Multiple Strings](http://www.aclweb.org/anthology-new/D/D09/D09-1011.pdf) by Markus Dreyer and Jason Eisner
+* [Bayesian Learning of a Tree Substitution Grammar](http://www.aclweb.org/anthology/P/P09/P09-2012.pdf) by Matt Post and Daniel Gildea
@@ -0,0 +1,34 @@
+---
+layout: post
+title: "Dirichlet Distributions and Entropy"
+description: ""
+category:
+tags: [iHMM,Machine Learning,Statistics]
+---
+{% include JB/setup %}
+
+In between all the [Netflix excitement](http://news.bbc.co.uk/1/hi/technology/8268287.stm) I managed to read a paper that was mentioned by David Blei during his machine learning summer school talk: [Entropy and Inference, Revisited](http://www.princeton.edu/~wbialek/our_papers/nemenman+al_02.pdf) by Ilya Nemenman, Fariel Shafee and William Bialek from NIPS 2002. This paper discusses some issues that arise when learning Dirichlet distributions. Since Dirichlet distributions are so common, I think these results should be more widely known.
+
+_Given that I’ve been reading a lot of natural language processing papers where Dirichlet distributions are quite common, I’m surprised I haven’t run into this work before._
+
+First a little bit of introduction on Dirichlet distributions. The Dirichlet distribution is a “distribution over distribution”; in other words: a draw from a Dirichlet distribution is a vector of positive real numbers that sum up to one. The Dirichlet distribution is parameterized by a vector of positive real numbers which captures the mean and variance of the Dirichlet distribution. It is often very natural to work with a slightly constrained Dirichlet distribution called the symmetric Dirichlet distribution: in this case the vector of parameters to the Dirichlet are all the same number. This implies that the mean of the Dirichlet is a uniform distribution and the variance is captured by the magnitude for the vector of parameters. Let us denote with beta the parameter of the symmetric Dirichlet. Then, when \beta is small, samples from the Dirichlet will have high variance while with beta large, samples from the Dirichlet will have small variance. The plot below illustrates this idea for a Dirichlet with 1000 dimensions: the top plot has very small beta and hence a draw from this Dirichlet has only a few nonzero entries (hence high variance) while the Dirichlet with beta = 1, all entries of the sample have roughly the same magnitude (about 0.001).
+
+![Image 1]({{ BASE_PATH }}/assets/images/2009-10-03-dirichlet-distributions-and-entropy_1.png)
+
+Another way to approach the effect of beta is to look at the entropy of a sample from the Dirichlet distribution, denoted by S in the images below. The entropy of a Dirichlet draw is high when beta is large. More in particular, it is upper bounded by ln(D) where D is the dimensionality of the Dirichlet when beta approaches infinity and the Dirichlet distribution will approach a singular distribution at completely uniform discrete distribution. When beta approaches 0, a draw from a Dirichlet distribution approaches a delta peak on a random entry which is a distribution with entropy 1. The key problem the authors want to address is that when learning an unknown distribution, if we use a Dirichlet prior, beta pretty much fixes the allowed shapes while we might not have a good reason a priori to believe that what we want to learn is going to look like either one of these distributions.
+
+The way the authors try to give insight into this problem is by computing the entropy of a random draw of a Dirichlet distribution. In equations, if we denote with S the entropy, it will be a random variable with distribution
+
+![Image 1]({{ BASE_PATH }}/assets/images/2009-10-03-dirichlet-distributions-and-entropy_2.png)
+
+Computing the full distribution is hard but the authors give a method to compute its mean and variance. The following picture shows the mean and variance of the entropy for draws of a Dirichlet distributions. A bit of notation: K is the dimensionality of the Dirichlet distribution, Xi is the mean entropy (as a function of beta) and sigma is the variance of the entropy as a function of beta.
+
+![Image 1]({{ BASE_PATH }}/assets/images/2009-10-03-dirichlet-distributions-and-entropy_3.png)
+
+As you can see from this plot, the entropy of Dirichlet draws is extremely peaked for even moderately large K. The authors give a detailed analysis of what this implies but the main take-away message is this: _as you change beta, the entropy of the implied Dirichlet draws varies smoothly, however, because the variance of the entropy is very peaked, the a priori choice of beta almost completely fixes the entropy_.
+
+This is problematic as it means that unless our distribution is sampled almost completely, the estimate of the entropy is dominated by the choice of our prior beta. So how can we fix this? The authors suggest a scheme which I don’t completely understand but boils down to a mixture of Dirichlet distributions by specifying a prior distribution on beta as well.
+
+This mixture idea ties in with something we did in our [EMNLP 09 paper](http://mlg.eng.cam.ac.uk/pub/pdf/VanVlaGha09.pdf): when we were training our part-of-speech tagger we had to choose a prior for the distribution which specifies what words are generated for a particular part-of-speech tag. We know that we have part-of-speech tag classes that generate very few words (e.g. determiners, attributes, …) and a few classes that generate a lot of words (e.g. nouns, adjectives, …). At first, we chose a simple Dirichlet distribution (with fixed beta) as our prior and although the results were reasonable, we did run into the effect explained above: if we set beta to be large, we got very few states in the iHMM where each state outputs a lot of words. This is good to capture nouns and verbs but not good for other classes. Conversely, when we chose a small beta we got a lot of states in the iHMM each generating only a few words. Our next idea was to put a Gamma prior on beta; this helped a bit, but still assumed that there is only one value of beta (or one type of entropy distribution) which we want to learn. This is again unreasonable in the context of natural language. Finally, we chose to put a Dirichlet process prior on beta (with a Gamma base measure). This essentially allows different states in the iHMM to have different beta’s (but we only expect to see a few discrete beta’s).
+
+[Entropy and Inference, Revisited](http://www.princeton.edu/~wbialek/our_papers/nemenman+al_02.pdf) is one of those papers with a lot of intuitive explanations; hopefully it helps you make the right choice for priors in your next paper as well.
@@ -0,0 +1,37 @@
+---
+layout: post
+title: "ECML–PKDD 2010 Highlights"
+description: ""
+category:
+tags: [Conference Reports]
+---
+{% include JB/setup %}
+
+![Image 1]({{ BASE_PATH }}/assets/images/2010-10-12-ecmlpkdd-2010-highlights_1.png)
+
+The city of Barcelona just hosted ECML/PKDD this year and I had the opportunity to go down and check out the latest and greatest mostly from the European machine learning community.
+
+The conference for me started with a good tutorial by Francis Bach and Guillaume Obozinsky. They gave an overview of sparsity: in particular various different methods using l1 regularization to induce sparsity.
+
+The first invited speaker was Hod Lipson from Cornell (and as far as I know one of the few people in our field who has given a TED talk). The main portion of Hod’s talk was about his work on symbolic regression. The idea is the following: consider the following dataset
+
+
+![Image 1]({{ BASE_PATH }}/assets/images/2010-10-12-ecmlpkdd-2010-highlights_2.png)
+
+We can apply our favourite regression method, say a spline, to these points and perform accurate interpolation, perhaps even some extrapolation if we chose the right model. The regression function would not give us much insight into why data could be looking like this? In symbolic regression, the idea is that we try to come up with a symbolic formula which interpolates the data. In the picture above, the formula that generated the data was EXP(-x)*SIN(PI()*x)+RANDBETWEEN(-0.001,0.001) (in Excel). Hod and his graduate students have built a very cool (and free!) app called Eureqa which uses a genetic programming methodology to find a good symbolic expression for a specific dataset. Hod showed us how his software can recover the Hamiltonian and Lagrangian from the measurements of a double pendulum. Absolutely amazing!
+
+Another noteworthy invited speaker was Jurgen Schmidhuber. He tried to convince us that we need to extend the reinforcement learning paradigm. The idea would be that instead of an agent trying to optimize the amount of long term reward he gets from a teacher, he would also try to collect “internal reward”. The internal reward is defined as follows: as the agent learns, he is building a better model of the world. Another way to look at this learning is that the agent just knows how to “compress” better. The reduction in representation he gets from learning from a particular impression is what Jurgen calls the “internal reward”. In other words, it is the difference between the number of bits needed to represent your internal model before and after an impression.
+
+E.g. you listen to a new catchy son: Jurgen says that you think it’s catchy because you’ve never heard anything like this before, it is surprising and hence helps you learn a great deal. This in turn means you’ve just upgraded the “compression algorithm” in your brain and the amount of improvement is now reflected in you experiencing “internal reward”. Listening to a song you’ve heard a million times before doesn’t help compressing at all; hence, no internal reward.
+
+I really like the idea about this internal reward a lot, and as far as I understand it would be very easy to test. Unfortunately, I did not see any convincing experiments so allow me to be sceptical ...
+
+The main conference was cool and I’ve met some interesting people working on things like probabilistic logic, a topic I desperately need to learn more about. Gjergji gave a talk about [our work on crowdsourcing](http://research.microsoft.com/apps/pubs/default.aspx?id=132648) (more details in a separate post). Some things I marked for looking into are:
+* Sebastian Riedel, Limin Yao and Andrew McCallum – “Modeling Relations and Their Mentions Without Labeled Text”: this paper is about how to improve information extraction methods which bootstrap from existing knowledge bases using constraint driven learning techniques.
+* Wanner Meert, Nima Taghipour & Hendrik Blockeel – “First Order Bayes Ball”: a paper on how to use the Bayes ball algorithm to figure out which nodes not to ground before running lifted inference.
+* Daniel Hernández-Lobato, José Miguel Hernández Lobato, Thibault Helleputte, Pierre Dupont – “Expectation Propagation for Bayesian Multi-task Feature Selection”: a paper on how to run EP for spike and slab models.
+* Edith Law, Burr Settles, and Tom Mitchell – “Learning to Tag from Open Vocabulary Labels”: a nice paper on how to deal with tags: they use topic models to do dimensionality reduction on free text tags and then use that in a maximum entropy predictor to tag new music.
+
+I enjoyed most of the industry day as well: I found it quite amusing that the Microsoft folks essentially gave all [Bing](http://www.bing.com/) secrets away in one afternoon: Rakesh Agarwal mentioned the secret sauce behind Bing’s ranker (neural net) whereas Thore Graepel explained the magic behind the advertisements selection mechanism (probit regression). Videos of these talks should be on videolectures.net soon.
+
+One last rant: the proceedings are published by Springer who ask me to pay for the proceedings?!?! I’m still trying to figure out what value they’ve added to our camera-ready we sent them a few months ago …
Oops, something went wrong.

0 comments on commit bc0e617

Please sign in to comment.