Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d3b3940
commit c7da156
Showing
6 changed files
with
55 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# An Alternative View: When Does SGD Escape Local Minima? | ||
|
||
Robert Kleinberg, Yuanzhi Li, Yang Yuan. [An Alternative View: When Does SGD Escape Local Minima?](https://arxiv.org/pdf/1802.06175.pdf) ICML 2018. | ||
|
||
## tl;dr | ||
- Tacking the question of "Why does deep learning work?", here we explore whether stochastic gradient descent (SGD) escapes local minima? Yes for convex, usually yes for non-convex. | ||
- Usually yes means the weighted average of the gradients of its neighborhoods must be one point convex with respect to desired x-star. | ||
- Empirically the authors show that the neural networks loss surface exhibits one point convexity locally. | ||
|
||
## One point convexity | ||
Informally, a function f is c-point convex with fixed point x-star, step size n, noise W(x), and y = x - n\*grad(f(x)) if: the inner product of the gradient of the expected value of the neighborhood of y and the direction x-star - y is greater than the 2-norm of c \*(x-star - y). We then know that y will converge to x-star with decent probability. | ||
|
||
## Motivating example | ||
The authors use the motivating example of a parabola with added spikey noise. It is conjectured that flat local minima leads to better generalization, so using an example of spikey noise represents the most extreme local minima to overcome. | ||
|
||
## Main theorem | ||
The main theorem shows that points y will converge to fixed point x-star and once there, it will stay there. The main assumption is L-smoothness. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Learning to Ask Good Questions | ||
|
||
Sudha Rao, Hal Daume. [Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information.](https://arxiv.org/pdf/1805.04655.pdf) ACL 2018. | ||
|
||
## tl;dr | ||
- As the title suggests, the authors rank clarification questions for an original question to get more succinct help. | ||
- Since I did an NLP final project similar to this project, I am particularly intrigued. | ||
- Neural network architecture is novel, and the real question seems to be dataset generation. | ||
|
||
## Test Time Pipeline | ||
1. Given a post p, retrieve 10 similar posts in the training set using Lucene | ||
2. The questions asked by those 10 similar posts p_i are then q_i and become the candidate set Q. Edits made to post in response to the questions are answer set A. | ||
3. For each possible clarification question q_i, we generate an answer representation F(p, q_i) and calculate how close answer candidate a_j is to F(p, q_i) | ||
4. We then calculate utility gain to post p if it were updated with answer a_j | ||
5. Given expected utilities, we rank candidate questions by expected utility. | ||
|
||
## Loss functions | ||
From the pipeline, we see that a lot of steps involve quantifying "better" questions, answers, and utility functions. Without getting too mathematical (mostly because I haven't figured out how to implement MathJax on my Github pages), we describe the fucntions. | ||
|
||
1. Lucene uses a variant of TF-IDF to find related documents | ||
2. No math, just aggregation. | ||
3. Answer representation F(p, q_i) comes from a neural representation. Answer distance is calculated with cosine similarity of the average word vector of answer a_j. | ||
4. Expected utility depends on the probability of answer candidate a_j being the answer to question q_i and the utility value of adding the information. We model the probability P(a_j | p, q_i) as a negative exponential of the distance between F(p, q_i) and a-hat_j or the average word vector. The utility function is defined as the sigmoid of F_util where F_util is also a neural network. | ||
5. We sort by expected utilites then. | ||
|
||
The neural networks mentioned are one joint model using a question LSTM and an answer LSTM with a loss function based on the existing triples (post, question, and answer). | ||
|
||
## Evaluation | ||
Because of the tricky nature of the experimental setup, expert annotators are vital. Evaluations are conducted with expert annotations, against the original question, and excluding the original question. | ||
|
||
## Next steps | ||
For my person research, it might be useful to think about health question and answering and how to automate a health knowledge graph. Isn't this classification model another type of HKG? |