ch_discussion.tex

% !TEX root = altosaar-2020-thesis.tex
\chapter{Discussion}
\label{ch:discussion}
\lettrine[image=true,lines=3]{design/P}{robabilistic} modeling is useful across scientific domains. However, probabilistic modeling methods that do not take into account the structure of a problem, the form of individual datapoints, or information about probability distributions during optimization leave performance gains on the table.

As a motivating example, we built the structure of a statistical physics model into a probabilistic modeling method with \acrlongpl{hvm}. Efficient use of the connectivity patterns in physics models enabled scaling variational approximations to models with millions of random variables.

There is also utility in constructing probabilistic models with knowledge about individual datapoints. \acrlong{rfs} outperforms competitive recommendation models that either fail to take into account the goals of recommendation or the structure of items with sets of attributes.

We also improved variational inference, by making use of information about probability distributions within the \acrlong{PVI} algorithm. This enabled accurate inferences about probability distributions.

To further unify the thesis of problem structure as utile in probabilistic modeling, we test \acrlong{PVI} to measure whether the benefits of leveraging knowledge about a probability distributions are additive to performance gains from developing applied methods.

\input{table/tab_pvi_hvm}
Consider an Ising model studied in \Cref{ch:hvm}, where the goal is accurate inference of the free energy. \Cref{tab:pvi-hvm} shows a comparison between \gls{vi} and \gls{PVI} in an \gls{hvm}. This is a result of testing the best-performing settings from \Cref{ch:pvi} with the entropy constraint on both the variational prior and recursive variational approximation in an \gls{hvm}. The additional information \gls{PVI} makes available to the variational approximation during optimization leads to more accurate inference of the free energy.

Further, \gls{PVI} can be applied to probability models fit with maximum likelihood estimation. \Cref{tab:pvi-rfs} reports the performance of a \gls{rfs} model from \Cref{sec:rfs-experiments} fit to arXiv user behavior data. Fitting the recommendation model using the \gls{PVI} entropy proximity constraint improves top-10 recommendation recall. Metrics other than out-matrix recall (e.g. in-matrix recall) were comparable between these methods. With the \gls{PVI} entropy constraint, the recommendation performance of \gls{rfs} also improved in the meal recommendation task. \Cref{tab:pvi-rfs-meals} reports these results. The best-performing settings from \Cref{ch:pvi} generalize to maximum likelihood estimation in recommender systems, here giving a $6.9\%$ boost in top-$1$ recommender recall.

That \gls{PVI} yielded improvements when applied to both \glspl{hvm} applied to statistical physics problems and the \gls{rfs} recommendation model highlights several directions for further research. First, might \gls{PVI} yield further gains in accuracy when applied to statistical physics models with millions of random variables? Practitioners are willing to trade off diminished accuracy for scale in some cases, and \gls{PVI} is straightforward to test in new probability models and might help reduce the need for such trade-offs.
\input{table/tab_pvi_rfs}

Studying where \gls{PVI} yields marginal gains is also worth considering. For example, the entropy proximity constraint yielded less-significant improvements when applied to \gls{rfs} fit to the meal recommendation data in \Cref{sec:rfs-experiments}. This may be because the large size of data helped prevent overfitting, leading to reduced benefits of constraining parameter updates. In contrast, \glspl{hvm} fit to statistical physics models in \Cref{sec:hvm-experiments} converged to a solution very quickly, so monitoring convergence rates may be an additional source of information for proximity statistics.
\input{table/tab_pvi_rfs_meals}

While \Cref{ch:hvm} studied classical statistical physics models, future work in computational materials science and computational drug discovery will need to incorporate or approximate quantum effects. Density functional theory calculations based on quantum mechanics are expensive~\citep{schmidt2019recent} and limit the length of time that a material or drug binding to a protein can be simulated. Future work in this area should include study of the trade-off between the size of a system and the accuracy needed to study the behavior of a system to achieve a materials design or drug design goal. For example, suppose the behavior of a drug binding to a protein over the course of several seconds is of clinical interest. Then a practitioner might tolerate more inaccuracy in an \gls{hvm} approximation than they would if the short-run behavior could be accurately captured in a density functional theory calculation. One way of improving the trade-off may be to reduce the cost of fitting \glspl{hvm} by derive objective functions with better gradient signal-to-noise ratio~\citep{tucker2018doubly,rainforth2018tighter}. A similar trade-off occurs for system size, and it is unclear where \glspl{hvm} may provide the only way to model a large-scale physical system.

\Cref{ch:rfs} developed \gls{rfs}, and there remain several directions for future work on recommendation models for items with sets of attributes. Probabilistic generative models for use in recommendation may enable better recommendations under uncertainty, or easier incorporation of prior knowledge. However, probability distributions of sets of attributes are difficult to parameterize. One example of a distribution defined on sets is the Wallenius distribution~\citep{wallenius1963biased,junqu2000wallenius}. It is interesting to consider how a distribution on sets might be parameterized using a permutation-invariant model such as \gls{rfs}~\citep{bloem-reddy2019probabilistic,lee2018set-transformer}. Further, generalization bounds are necessary follow-up work to universal approximation properties. A model may be able to represent a distribution, but for practical purposes a key desideratum is finding functions, nonlinearities, and architectures that make optimization easy and generalization feasible~\citep{dziugaite2017computing}.

Another line of work is in developing robust negative sampling-based objective functions. The numeric value of the negative log-likelihood objective function used in \gls{rfs} or other models that use negative samples and embeddings cannot reliably assess convergence. This is due to embeddings that are used in both positive and negative examples, leading stochastic gradient updates to increase and decrease Monte Carlo estimates of the objective during optimization. Reliable methods to estimate the value of objective functions may help reduce the need for expensive recommender systems evaluation metrics where a model may need to be evaluated on every item in an evaluation set. While \gls{rfs} was designed for the recall evaluation metric, connecting binary classification objective functions with negative examples to ranking-based metrics such as normalized discounted cumulative gain would make these models useful broadly.

In \Cref{ch:rfs}, we found that \gls{rfs} outperforms \gls{lstm} recurrent neural networks in the task of recommending arXiv documents to users. This is counterintuitive, as the order of item attributes (words in abstracts) should carry significant information.  However, the computational budget was fixed for both models, and it is unclear which recommendation model to use with a large computational budget. Models such as transformers~\citep{vaswani2017attention,devlin2019bert:,lee2018set-transformer} might lead to improved recommendation performance, but at a greater computational cost than models such as \gls{rfs} with inner product parameterizations. Analyzing these trade-offs will help make informed choices of computational budget given performance requirements in practice. Under computational constraints due to monetary budget or privacy regulation, such as in clinical settings~\citep{huang2019clinicalbert:}, models such as \gls{rfs} that make fast, accurate, predictions may be preferable to more accurate, slower models.

Through careful consideration of how to build problem structure into probabilistic models, we were able to scale variational methods to statistical physics models with millions of random variables, fit recommender systems to tens of millions of datapoints, and improve the accuracy of variational inference. This highlights the need to ensure that progress in probabilistic modeling continues to be translated into progress in applied domains such as statistical physics and recommender systems.