Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

None

  • Loading branch information...
commit b7272b7b8060224e4fffa06414ff8f2694e597d2 1 parent 9d77285
@kyteague authored
View
8 blog/atom.xml
@@ -7,7 +7,7 @@
<link href="http://kyleteague.com/"/>
- <updated>2011-10-18T03:07:56Z</updated>
+ <updated>2011-10-18T10:14:37Z</updated>
<id>http://kyleteague.com/blog/atom.xml/</id>
@@ -27,12 +27,12 @@
<content type="html">
&lt;h2&gt;Introduction&lt;/h2&gt;
-&lt;p&gt;Earlier this year my colleagues and I participated in both tracks of the &lt;a href=&#34;http://kddcup.yahoo.com&#34;&gt;Yahoo! &lt;span class=&#34;caps&#34;&gt;KDD&lt;/span&gt; Cup&lt;/a&gt;. To give some background, Yahoo! released a dataset of around 250 million music ratings on genres, artists, albums, and tracks. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given track (binary&amp;nbsp;classification).&lt;/p&gt;
+&lt;p&gt;Earlier this year my colleagues and I participated in both tracks of the &lt;a href=&#34;http://kddcup.yahoo.com&#34;&gt;Yahoo! &lt;span class=&#34;caps&#34;&gt;KDD&lt;/span&gt; Cup&lt;/a&gt;. To give some background, the &lt;span class=&#34;caps&#34;&gt;KDD&lt;/span&gt; Cup is the biggest annual machine learning competition (at least unpaid). This year Yahoo! sponsored the event and released a dataset of around 250 million music ratings on genres, artists, albums, and tracks (&amp;#8220;items&amp;#8221;). There were two different tracks in the competition. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given item (binary&amp;nbsp;classification).&lt;/p&gt;
&lt;p&gt;I ended up leaving the company sponsoring my work early in the competition to join &lt;a href=&#34;http://getglue.com&#34;&gt;GetGlue&lt;/a&gt; as their data scientist, effectively ending my contributions code-wise. I was around #3 or #4 in Track 1 at the time with a blended matrix factorization algorithm. It looked at a few temporal and hierarchical features and was solved using a simple gradient descent algorithm with an adaptive learning rate. It was still good enough for 24th place overall. I don&amp;#8217;t know the exact number of teams, but I heard unofficially that it topped over 1,000. My colleague, &lt;a href=&#34;http://www.ee.ucla.edu/~jskong&#34;&gt;Joseph Kong&lt;/a&gt; stayed on and continued to work part time on Track 2 where we (mostly he) &lt;a href=&#34;http://kddcup.yahoo.com/leaderboard.php?track=2&amp;amp;n=100&#34;&gt;placed 9th&lt;/a&gt;. Unlike other competitors ahead of us though, we did not use an ensemble stack (other than stacking our own algorithm with parameter variations). The best single method was around 3.7%. Most of the top &lt;span class=&#34;caps&#34;&gt;MF&lt;/span&gt;-&lt;span class=&#34;caps&#34;&gt;BPR&lt;/span&gt; algorithms could only achieve about what we did on the Test1&amp;nbsp;set.&lt;/p&gt;
&lt;h2&gt;A Primer on Square&amp;nbsp;Counting&lt;/h2&gt;
-&lt;p&gt;We differed from most of the competitors as Joseph came up with a radical &amp;#8220;square counting&amp;#8221; algorithm which was inspired by triangle counting in graph mining.
+&lt;p&gt;We differed from most of the competitors due to Joseph coming up with a radical &amp;#8220;square counting&amp;#8221; algorithm. He was inspired by triangle counting in graph mining.
&lt;img class=&#34;cent&#34; src=&#34;/media/images/square_configs.png&#34; alt=&#34;Square Configurations&#34; /&gt;
-So let&amp;#8217;s say we want to see if you&amp;#8217;ll like song A. You and a friend both love song B, but you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&amp;#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict, which reduces each pair into a feature vector of length 2&lt;sup&gt;3&lt;/sup&gt;. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&amp;#8217;re wondering, our research shows that you&amp;#8217;ll most likely hate song A &amp;#8212; hate is a better&amp;nbsp;predictor.&lt;/p&gt;
+So let&amp;#8217;s say we want to see if you&amp;#8217;ll like song A. You and a friend both love song B and you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&amp;#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict reducing each pair into a feature vector of length 2&lt;sup&gt;3&lt;/sup&gt;. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&amp;#8217;re wondering, our research shows that you&amp;#8217;ll most likely hate song A &amp;#8212; hate is a better&amp;nbsp;predictor.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Square counting is an effective algorithm in a bipartite graph with a binary rating system. We have ideas on how to leverage square counting for regression (think soft-max) and other graph types, but this is simply the first iteration. It is essentially a feature extraction stage which transforms the problem from a sparse ratings matrix into a traditional classification problem. The nice part is you can easily parallelize it, then feed the output into any number of machine learning frameworks, which require far less processing time than would be required for a matrix factorization method. Using the &lt;a href=&#34;http://cran.r-project.org/web/packages/gbm&#34;&gt;gbm package&lt;/a&gt; in R worked just fine for our use&amp;nbsp;case.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;http://kddcup.yahoo.com/pdf/Track2-KKTsLearningMachine-Paper.pdf&#34;&gt;Read the paper here&lt;/a&gt; or view the&amp;nbsp;slides:&lt;/p&gt;
View
2  blog/excerpts.xml
@@ -7,7 +7,7 @@
<link href="http://kyleteague.com/"/>
- <updated>2011-10-18T03:07:56Z</updated>
+ <updated>2011-10-18T10:14:37Z</updated>
<id>http://kyleteague.com/blog/excerpts.xml/</id>
View
6 blog/top-10-kdd-cup.html
@@ -55,12 +55,12 @@ <h1 class="title">
<div class="post_body">
<h2>Introduction</h2>
-<p>Earlier this year my colleagues and I participated in both tracks of the <a href="http://kddcup.yahoo.com">Yahoo! <span class="caps">KDD</span> Cup</a>. To give some background, Yahoo! released a dataset of around 250 million music ratings on genres, artists, albums, and tracks. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given track (binary&nbsp;classification).</p>
+<p>Earlier this year my colleagues and I participated in both tracks of the <a href="http://kddcup.yahoo.com">Yahoo! <span class="caps">KDD</span> Cup</a>. To give some background, the <span class="caps">KDD</span> Cup is the biggest annual machine learning competition (at least unpaid). This year Yahoo! sponsored the event and released a dataset of around 250 million music ratings on genres, artists, albums, and tracks (&#8220;items&#8221;). There were two different tracks in the competition. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given item (binary&nbsp;classification).</p>
<p>I ended up leaving the company sponsoring my work early in the competition to join <a href="http://getglue.com">GetGlue</a> as their data scientist, effectively ending my contributions code-wise. I was around #3 or #4 in Track 1 at the time with a blended matrix factorization algorithm. It looked at a few temporal and hierarchical features and was solved using a simple gradient descent algorithm with an adaptive learning rate. It was still good enough for 24th place overall. I don&#8217;t know the exact number of teams, but I heard unofficially that it topped over 1,000. My colleague, <a href="http://www.ee.ucla.edu/~jskong">Joseph Kong</a> stayed on and continued to work part time on Track 2 where we (mostly he) <a href="http://kddcup.yahoo.com/leaderboard.php?track=2&amp;n=100">placed 9th</a>. Unlike other competitors ahead of us though, we did not use an ensemble stack (other than stacking our own algorithm with parameter variations). The best single method was around 3.7%. Most of the top <span class="caps">MF</span>-<span class="caps">BPR</span> algorithms could only achieve about what we did on the Test1&nbsp;set.</p>
<h2>A Primer on Square&nbsp;Counting</h2>
-<p>We differed from most of the competitors as Joseph came up with a radical &#8220;square counting&#8221; algorithm which was inspired by triangle counting in graph mining.
+<p>We differed from most of the competitors due to Joseph coming up with a radical &#8220;square counting&#8221; algorithm. He was inspired by triangle counting in graph mining.
<img class="cent" src="/media/images/square_configs.png" alt="Square Configurations" />
-So let&#8217;s say we want to see if you&#8217;ll like song A. You and a friend both love song B, but you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict, which reduces each pair into a feature vector of length 2<sup>3</sup>. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&#8217;re wondering, our research shows that you&#8217;ll most likely hate song A &#8212; hate is a better&nbsp;predictor.</p>
+So let&#8217;s say we want to see if you&#8217;ll like song A. You and a friend both love song B and you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict reducing each pair into a feature vector of length 2<sup>3</sup>. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&#8217;re wondering, our research shows that you&#8217;ll most likely hate song A &#8212; hate is a better&nbsp;predictor.</p>
<h2>Conclusion</h2>
<p>Square counting is an effective algorithm in a bipartite graph with a binary rating system. We have ideas on how to leverage square counting for regression (think soft-max) and other graph types, but this is simply the first iteration. It is essentially a feature extraction stage which transforms the problem from a sparse ratings matrix into a traditional classification problem. The nice part is you can easily parallelize it, then feed the output into any number of machine learning frameworks, which require far less processing time than would be required for a matrix factorization method. Using the <a href="http://cran.r-project.org/web/packages/gbm">gbm package</a> in R worked just fine for our use&nbsp;case.</p>
<p><a href="http://kddcup.yahoo.com/pdf/Track2-KKTsLearningMachine-Paper.pdf">Read the paper here</a> or view the&nbsp;slides:</p>
View
6 index.html
@@ -56,12 +56,12 @@ <h1 class="title">
<div class="post_body">
<h2>Introduction</h2>
-<p>Earlier this year my colleagues and I participated in both tracks of the <a href="http://kddcup.yahoo.com">Yahoo! <span class="caps">KDD</span> Cup</a>. To give some background, Yahoo! released a dataset of around 250 million music ratings on genres, artists, albums, and tracks. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given track (binary&nbsp;classification).</p>
+<p>Earlier this year my colleagues and I participated in both tracks of the <a href="http://kddcup.yahoo.com">Yahoo! <span class="caps">KDD</span> Cup</a>. To give some background, the <span class="caps">KDD</span> Cup is the biggest annual machine learning competition (at least unpaid). This year Yahoo! sponsored the event and released a dataset of around 250 million music ratings on genres, artists, albums, and tracks (&#8220;items&#8221;). There were two different tracks in the competition. Track 1 was concerned with predicting the ratings a user gave on a held-out test set (regression) and Track 2 was focused on predicting whether a user would love or hate a given item (binary&nbsp;classification).</p>
<p>I ended up leaving the company sponsoring my work early in the competition to join <a href="http://getglue.com">GetGlue</a> as their data scientist, effectively ending my contributions code-wise. I was around #3 or #4 in Track 1 at the time with a blended matrix factorization algorithm. It looked at a few temporal and hierarchical features and was solved using a simple gradient descent algorithm with an adaptive learning rate. It was still good enough for 24th place overall. I don&#8217;t know the exact number of teams, but I heard unofficially that it topped over 1,000. My colleague, <a href="http://www.ee.ucla.edu/~jskong">Joseph Kong</a> stayed on and continued to work part time on Track 2 where we (mostly he) <a href="http://kddcup.yahoo.com/leaderboard.php?track=2&amp;n=100">placed 9th</a>. Unlike other competitors ahead of us though, we did not use an ensemble stack (other than stacking our own algorithm with parameter variations). The best single method was around 3.7%. Most of the top <span class="caps">MF</span>-<span class="caps">BPR</span> algorithms could only achieve about what we did on the Test1&nbsp;set.</p>
<h2>A Primer on Square&nbsp;Counting</h2>
-<p>We differed from most of the competitors as Joseph came up with a radical &#8220;square counting&#8221; algorithm which was inspired by triangle counting in graph mining.
+<p>We differed from most of the competitors due to Joseph coming up with a radical &#8220;square counting&#8221; algorithm. He was inspired by triangle counting in graph mining.
<img class="cent" src="/media/images/square_configs.png" alt="Square Configurations" />
-So let&#8217;s say we want to see if you&#8217;ll like song A. You and a friend both love song B, but you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict, which reduces each pair into a feature vector of length 2<sup>3</sup>. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&#8217;re wondering, our research shows that you&#8217;ll most likely hate song A &#8212; hate is a better&nbsp;predictor.</p>
+So let&#8217;s say we want to see if you&#8217;ll like song A. You and a friend both love song B and you and another friend hate song C. The first friend loves A, but the second friend hates it. The situations with the two friends are modeled in the above figure, specifically in configurations 7 and 0. Now let&#8217;s expand this to thousands of friends. We can actually count the configurations for each person/song pair we want to predict reducing each pair into a feature vector of length 2<sup>3</sup>. The feature vectors can then be passed into your classic classification algorithm of choice. Of course, there are other subtleties like normalizing the counts based on the degrees of the nodes and other enhancements we made to get down to a 4.6% error rate, but the basic idea is pretty simple. There is also the problem of designing an efficient algorithm to actually do the square counting. Oh, and if you&#8217;re wondering, our research shows that you&#8217;ll most likely hate song A &#8212; hate is a better&nbsp;predictor.</p>
<h2>Conclusion</h2>
<p>Square counting is an effective algorithm in a bipartite graph with a binary rating system. We have ideas on how to leverage square counting for regression (think soft-max) and other graph types, but this is simply the first iteration. It is essentially a feature extraction stage which transforms the problem from a sparse ratings matrix into a traditional classification problem. The nice part is you can easily parallelize it, then feed the output into any number of machine learning frameworks, which require far less processing time than would be required for a matrix factorization method. Using the <a href="http://cran.r-project.org/web/packages/gbm">gbm package</a> in R worked just fine for our use&nbsp;case.</p>
<p><a href="http://kddcup.yahoo.com/pdf/Track2-KKTsLearningMachine-Paper.pdf">Read the paper here</a> or view the&nbsp;slides:</p>
Please sign in to comment.
Something went wrong with that request. Please try again.