add last two days

irenetrampoline · Jun 11, 2018 · c7da156 · c7da156
1 parent d3b3940
commit c7da156
Show file tree

Hide file tree

Showing 6 changed files with 55 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -2,6 +2,10 @@
 My goal is to read an academic paper every day. Here I keep myself accountable.
 
 ## Papers
+**Jun 11, 2018:** [Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information.](writeups/RaoDau18.md) S. Rao, H. Daume. 2018. [[pdf]](https://arxiv.org/pdf/1805.04655.pdf)
+
+**Jun 10, 2018:** [An Alternative View: When Does SGD Escape Local Minima?](writeups/KleLiYua18.md) R. Kleinberg, Y. Li, Y. Yuan. 2018. [[pdf]](https://arxiv.org/pdf/1802.06175.pdf)
+
 **Jun 05, 2018:** [Do CIFAR-10 Classifiers Generalize to CIFAR-10?.](writeups/RecEtAl18.md) B. Recht, R. Roelofs, L. Schmit, V. Shankar. 2018. [[pdf]](https://arxiv.org/pdf/1806.00451.pdf)
 
 **Jun 03, 2018:** [Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health.](writeups/AltClaLes16.md) T. Althoff, K. Clark, J. Leskovec. 2016. [[pdf]](http://www.aclweb.org/anthology/Q16-1033)

diff --git a/pdfs/KleLiYua18.pdf.pdf b/pdfs/KleLiYua18.pdf.pdf
diff --git a/pdfs/RaoDau18.pdf.pdf b/pdfs/RaoDau18.pdf.pdf
diff --git a/publish.py b/publish.py
@@ -34,7 +34,8 @@ def main():
 
 		year = all_nums[-1]
 	except:
-		raise ValueError('Year must end with e.g. "2014."')
+		pdb.set_trace()
+		raise ValueError('Year must end with digits, e.g. "2014."')
 
 	if authors_N < 4: 
 		md_title = []

diff --git a/writeups/KleLiYua18.md b/writeups/KleLiYua18.md
@@ -0,0 +1,17 @@
+# An Alternative View: When Does SGD Escape Local Minima?
+
+Robert Kleinberg, Yuanzhi Li, Yang Yuan. [An Alternative View: When Does SGD Escape Local Minima?](https://arxiv.org/pdf/1802.06175.pdf) ICML 2018.
+
+## tl;dr
+ - Tacking the question of "Why does deep learning work?", here we explore whether stochastic gradient descent (SGD) escapes local minima? Yes for convex, usually yes for non-convex.
+ - Usually yes means the weighted average of the gradients of its neighborhoods must be one point convex with respect to desired x-star.
+ - Empirically the authors show that the neural networks loss surface exhibits one point convexity locally.
+
+## One point convexity
+Informally, a function f is c-point convex with fixed point x-star, step size n, noise W(x), and y = x - n\*grad(f(x)) if: the inner product of the gradient of the expected value of the neighborhood of y and the direction x-star - y is greater than the 2-norm of c \*(x-star - y). We then know that y will converge to x-star with decent probability.
+
+## Motivating example
+The authors use the motivating example of a parabola with added spikey noise. It is conjectured that flat local minima leads to better generalization, so using an example of spikey noise represents the most extreme local minima to overcome.
+
+## Main theorem
+The main theorem shows that points y will converge to fixed point x-star and once there, it will stay there. The main assumption is L-smoothness.
diff --git a/writeups/RaoDau18.md b/writeups/RaoDau18.md
@@ -0,0 +1,32 @@
+# Learning to Ask Good Questions
+
+Sudha Rao, Hal Daume. [Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information.](https://arxiv.org/pdf/1805.04655.pdf) ACL 2018.
+
+## tl;dr
+ - As the title suggests, the authors rank clarification questions for an original question to get more succinct help.
+ - Since I did an NLP final project similar to this project, I am particularly intrigued. 
+ - Neural network architecture is novel, and the real question seems to be dataset generation. 
+
+## Test Time Pipeline
+ 1. Given a post p, retrieve 10 similar posts in the training set using Lucene 
+ 2. The questions asked by those 10 similar posts p_i are then q_i and become the candidate set Q. Edits made to post in response to the questions are answer set A.
+ 3. For each possible clarification question q_i, we generate an answer representation F(p, q_i) and calculate how close answer candidate a_j is to F(p, q_i)
+ 4. We then calculate utility gain to post p if it were updated with answer a_j
+ 5. Given expected utilities, we rank candidate questions by expected utility.
+
+## Loss functions
+From the pipeline, we see that a lot of steps involve quantifying "better" questions, answers, and utility functions. Without getting too mathematical (mostly because I haven't figured out how to implement MathJax on my Github pages), we describe the fucntions.
+
+ 1. Lucene uses a variant of TF-IDF to find related documents
+ 2. No math, just aggregation.
+ 3. Answer representation F(p, q_i) comes from a neural representation. Answer distance is calculated with cosine similarity of the average word vector of answer a_j.
+ 4. Expected utility depends on the probability of answer candidate a_j being the answer to question q_i and the utility value of adding the information. We model the probability P(a_j | p, q_i) as a negative exponential of the distance between F(p, q_i) and a-hat_j or the average word vector. The utility function is defined as the sigmoid of F_util where F_util is also a neural network.
+ 5. We sort by expected utilites then.
+
+The neural networks mentioned are one joint model using a question LSTM and an answer LSTM with a loss function based on the existing triples (post, question, and answer). 
+
+## Evaluation
+Because of the tricky nature of the experimental setup, expert annotators are vital. Evaluations are conducted with expert annotations, against the original question, and excluding the original question.
+
+## Next steps
+For my person research, it might be useful to think about health question and answering and how to automate a health knowledge graph. Isn't this classification model another type of HKG?