# <center>  Future Financial Planning Tools for Consumers

## <center> Ignace Decocq 
   

###   <p style="text-align: center;"> August 16, 2021 </p>

## Introduction

The financial decisions that consumers need to make in their present lifetime, become increasingly more complex. A good example of this phenomenon is the shift from defined benefits to defined contributions in which consumers take on greater individual responsibility and risks. The evolution in the abstruseness of financial products has become challenging for consumers who possess low financial knowledge and limiting numeracy skills {cite}`bi2017financial`. Combined with uncertainty about the future, the consumer is necessitated to be more aware of his financial well-being than ever before. Looking back into the past, Porteba et al, {cite}`poterba2011were` conducted an examination of preparedness in retirement for Children of Depression, War Baby, and the Early Baby Boomer in the Health and Retirement Study and Asset and Health Dynamics Among the Oldest Old cohorts. They found that 46.1 percent die with less than 10 000 dollars. With this amount of assets, they would not have the capacity to pay for unexpected events and one might wonder if it is adequate asset levels for retirement. Furthermore, saving behavior has not kept pace with increasing life expectation and the expected prolonged lifespan of the coming generations are unprecedented {cite}`hershfield2011future`. All these elements give a painstakingly clear picture that having a vital understanding of one's financial situation has become one of the greatest challenges in life.

To combat these difficulties, consumers require additional undertakings in planning for their future prosperity. One of the approaches to tackle this issue, is by using financial planning tools. These tools give the consumer the capability to estimate complex intertemporal calculations {cite}`bi2020limitations`. They also enhance financial behavior, increase household wealth accumulation and they are a complement to other planning aid like a financial advisor {cite}`bi2017financial`. Although financial planning tools can greatly benefit consumers, it can also be a double-edged sword. More specifically, when consumers are misinformed about the capabilities of the tool, or when the design of the tool is inadequate, the consumer can be given sub-optimal advice or even misleading advice {cite}`dorman2018efficacy`. Insufficiencies in design can arise when not all essential input variables are included, not all risks are considered, and when accuracy is sacrificed for the ease of use {cite}`bi2020limitations`. On top of that, there are wide variations in results because of the various methodology and assumptions used in the models {cite}`dorman2018efficacy`. For example, assumptions based on inflation and the use of different financial products have a large impact on the results. On the side of the consumer, the possibility of misunderstanding the implications of the results due to a lack of financial knowledge, is a matter of great concern in the eyes of financial educators {cite}`bi2020limitations`. Clarifying the results is therefore an essential part of making models operational. To improve upon these deficiencies, Dorman et al., {cite}`dorman2018efficacy` found that when the models handle additional theoretical variables, the accuracy will improve. Besides, they found that the consumer requires unique solutions that better capture their financial situation. Meaning planning tools need to be more flexible. They should be able to operate in different financial settings and have the ability to look at the impact of changes in input variables. To address the variability in results and the adaptability of models to different settings, this paper will look at reinforcement learning techniques in an intertemporal setting. Reinforcement Learning enables an increase in the flexibility of the model while keeping fundamental theoritical aspects like Optimal Control Theory at its core. 

For the remainder of the paper, the general theory of Deep Reinforcement Learning (DRL) will first be discussed and linked to financial planning. Then, we will go more into depth on Dynammic Programming which is one of the pillar stones of Reinforcement Learning. Next, a deep Backward Stochastic Differential Equation method is discussed which will solve the Terminal Partial Differential Equation of the dynammic programming system in higher dimensions. Lastly, an example which will implement this method is presented. 

## Reinforcement Learning 

Supervised and unsupervised learning are the two most widely studied and researched branches of Machine Learning (ML). Besides these two, there is also a third subcategorie in ML called Reinforcement Learning (RL). The three branches have fundamental differences between eachother. Supervised learning for example is designed to learn from a training set of labeled data, where each element of the training set describes a certain situation and is linked to a label/action the supervisor has provided {cite}`hammoudeh2018concise`. RL on the other hand is a method in which the machine tries to map situation to actions by maximizing a reward signal {cite}`arulkumaran2017brief`. The two methods are fundementally different from each other on the fact that in RL there is no supervisor which provides the label/action the machine needs to take, rather there is a reward system set up from which the machine can learn the correct action/label {cite}`hammoudeh2018concise`. contrarily to supervised learning, unsupervised learning tries to find hidden structures within an unlabeled dataset. This might seem similar to RL as both methods work with unlabeled datasets, but RL tries to maximize a reward signal instead of finding only hidden structures in the data {cite}`arulkumaran2017brief`. 

RL finds it roots in multiple research fields. Each of these fields contributes to the RL in its own unique way (see figure) {cite}`hammoudeh2018concise`. For example,  RL is similar to natural learning processes where the method of learning is by experiencing many failures and successes. Therefore psychologists have used RL to mimic psychological processes when an organism makes choices based on experienced rewards/punishments {cite}`eckstein2021reinforcement`. While psychologists are mimicing psychological processes, Nueroscientists are using RL to focus on a well-defined network or regions of the brain that implement value learning {cite}`eckstein2021reinforcement`. 

![Research fields involved in Reinforcement learning](tree.png)

### Finite Markov Decision Processes

RL can be represented in finite Markov decision processes (MDPs), which are classical formalizations of sequantial decision making. More specifically, MPDs give rise to a structure in which delayed rewards can be balanced with immediate rewards {cite}`sutton2018reinforcement`. It also enables a straightforward framing of learning from interaction to achieve a goal {cite}`levine2018reinforcement`. In it's most simplest form RL works with an Agent-Environment Interface. The agent is exposed to some representation of the environment's state $S_t \in \mathrm{S}$. From this representation the agent needs to chose an action $ A_t \in \mathcal{A}(s)$, which will result in a numerical reward $R_{t+1} \in \mathrm{R} $ and a new state $S_{t+1}$ (see figure 2) {cite}`sutton2018reinforcement`. The goal for the agent is to learn a mapping from states to action called a policy $\pi$ that maximizes the expected rewards:

$$ \pi^* = argmax_{\pi} E[R|\pi] $$

If the MPDs is finite and discrite, the sets of states, actions and rewards ($S$, $A$ , and $R$) all have a finite number of elements. The agent-environment interaction can then be subdivided into episode {cite}`arulkumaran2017brief`.  The agent's goal is to maximize the expected discounted cumulative return in the episode {cite}`franccois2018introduction`: 

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{T-t-1}R_T = \sum_{k=0}^T \gamma^k R_{t+k+1}$$ 

Where T indicates the terminal state and $\gamma$ is the discount rate. The terminal state $S_T$ is mostly followed by a reset to a starting state or sample from a starting distribution of states {cite}`franccois2018introduction`. An episode ends once the reset has occured. The discount rate represents the present value of future rewards. If $\gamma = 0$, the agent is myopic and is only concerned in maximizing the immediate rewards. The agent can consequently be considerd greedy {cite}`sutton2018reinforcement`. 

The returns can be rewritten in a dynammic programming approach:

$$ G_t = R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + ... + \gamma^{T-t-2}R_T) $$
$$ G_t = R_{t+1} + \gamma G_{t+1}$$ 




 

![Research fields involved in Reinforcement learning](standard_model.png)

A key concept of MPDs is the Markov property: Only the current state affects the next state {cite}`franccois2018introduction`. The random varianbles (RV) $R_t$ and $S_t$ have then well defined discrete probability distributions dependend only on the preceding state and action: 

$$ p(s', t| s, a) = Pr(S_t = s', R_t = r | S_{t-1} = s, A_{t-1}=a) $$

For all $s', s \in \mathrm{S} , r \in \mathrm{R}, a \in \mathrm{A}(s) $. The probability of each element in the sets $S$ and $R$ completely chararcterizes the environment {cite}`sutton2018reinforcement`. This can be relaxed by some alogrithms as this is an unrealistic assumption to make. The Partial Observable Markov Decision Process (POMDP) algorithm for example maintains a belief over the current state given the previous belief state, the action taken and the current observation {cite}`arulkumaran2017brief`.  Once $p$ is known, the environment is fully discribed and functions like a transition function $T : D \times A \to p(S)$ and a reward function $R: S \times A \times S \to \R$ can be deducted {cite}`sutton2018reinforcement`.

Most algorithms in RL use a value function to estimate the value of a given state for the agent. Value functions are defined by the policy $\pi$ the agent has decided to take. As mentioned previously, $\pi$ is the mapping of states to probabilities of selecting an action. The value function $v_{\pi}(s)$ in a state $s$ following a policy $\pi$ is as followes: 

$$ v_{\pi}(s) = E_{\pi}[G_t | S_t = s] = E_{\pi}[\sum_{k=0}^T \gamma^kR_{t+k+1} | S_t=s] $$

This can aso be rewritten in a dynammic programming approach: 

\begin{gather*} v_{\pi}(s) = E_{\pi}[G_t | S_t = s] \\ 
 = E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\
 = \sum_a \pi(a|s) \sum_{s'} \sum_r p(s', r|s,a)[r + \gamma E_{pi}[G_{t+1} | S_{t+1} = s'] \\
 = \sum_a \pi(a|s) \sum_{s', r}p(s', r|s,a)[r + \gamma v_{\pi}(s')| S_{t+1} = s'] \end{gather*}

 The formula is called the Bellman equation of $v_{\pi}$. It describes the relationschip between the value of a state and the values of its successor states given a certain policy $\pi$. The relation can also be represented by a backup diagram (see figure 3). If $v_{\pi}(s)$ is the value of a given state, then $q_{\pi}(s,a)$ is the value of a given action of that state: 

 $$ q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t = a] = E_{\pi}[\sum_{k=0}^T \gamma^kR_{t+k+1} | S_t=s, A_t = a] $$ 

 This can be seen in the backup diagram as starting from the black dot and cumputing the subsequential value thereafter. $q_{\pi}(s,a)$ is also called the action-value function as it describes each value of an action for each state. 

![Research fields involved in Reinforcement learning](backup_diagram.png)

For the agent it is important to find the optimal policy in which it maximizes the expected cumulative rewards. The optimal policy $\pi_*$ is the policy for which $v_{\pi_*}(s) > v_{\pi}(s)$ for all $s \in S$. An optimal policy also has the same action-value function $q_*(s,a)$ for all $s \in S$ and $a \in A$. The optimal policy does not depend soley on one policy and can encompass multiple policy. It is thus not policy dependend: 
$$ v_*(s) = max_{a \in A(s)} q_{\pi_*}(s,a) $$
$$ = max_{a} E_{\pi_*}[G_t | S_t=s, A_t=a] $$ 
$$ = max_{a} E_{\pi_*}[R_{t+1} + \gamma G_{t+1} | S_t=s, A_t=a] $$
$$ = max_{a} E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a] $$

Once $v_*(s)$ is found, you just need to apply a greedy algorithm as the optimal value function already takes into account the long-term consequences of choosing that action. Finding $q_*(s,a)$ makes things even easier, as the action-value function caches the result of all one-step-ahead searches. 

Solving the Bellman equation of the value function or the action-value function such that we know each all possibilities with their probabilities and rewards is in most practical cases not possible. Typical due to three main factors. The first problem is obtaining full knowledge of the dynamics of the environment. The second factor is the computational resources to complete the calculation. the last factor is that the states need to have the markov property.   To circumvent these obstacles RL tries to approximate the Bellman optimality equation using various methods. In the next chapter, a brief layout of theser method is discussed with a focus on the methods applicable for financial planning.  

###  model-based RL, model-free RL and planning

### Challenges in RL and deep RL

### Dynammic Programming 

### Reinforcement learning and financial planning 

## Optimal consumption, investment and life insurance in an intertemporal model

The first person to include uncertain lifetime and life insurance decisions in a discrete life-cycle model was Yaari {cite}`yaari1965uncertain`. He explored the model using a utility function without bequest (Fisher Utility function) and a utility function with bequest (Marshall Utility function) in a bounded lifetime. In both cases, he looked at the implications of including life insurance. Although Yaari's model was revolutionary in the sense that now the uncertainty of life could be modeled, Leung {cite}`leung1994uncertain` found that the constraints laid upon the Fisher utility function were not adequate and lead to terminal wealth depletion. Richard {cite}`richard1975optimal` applied the methodology of Merton {cite}`merton1969lifetime, merton1975optimum` to the problem setting of Yaari in a continuous time frame. Unfortunately, Richard's model had one deficiency: The bounded lifetime is incompatible with the dynamic programming approach used in Merton's model. As an individual approaches his maximal possible lifetime T, he will be inclined to buy an infinite amount of life insurance. To circumvent this Richard used an artificial condition on the terminal value. But due to the recursive nature of dynamic programming, modifying the last value would imply modifying the whole result. Ye {cite}`ye2006optimal`  found a solution to the problem by abandoning the bounded random lifetime and replacing it with a random variable taking values in $[0,\infty)$. The models that replaced the bounded lifetime, are thereafter called intertemporal models as the models did not consider the whole lifetime of an individual but rather looked at the planning horizon of the consumer.  Note that the general setting of Ye {cite}`ye2006optimal` has a wide range of theoretical variables, while still upholding a flexible approach to different financial settings. On this account, it is a good baseline to confront the issues concerning the current models of financial planning. However, one of the downsides of the model is the abstract representation of the consumer. Namely, the rational consumer is studied, instead of the actual consumer. To detach the model from the notion of rational consumer, I will more closely look at behavioral concepts that can be implemented. In the next paragraph various modification will be discussed and a further review is conducted on the behavioral modifications


After Ye {cite}`ye2006optimal` various models have been proposed which all have given rise to unique solutions to the consumption, investment, and insurance problem. The first unique setting is a model with multiple agents involved. For example,  Bruhn and Steffensen {cite}`bruhn2011household` analyzed the optimization problem for couples with correlated lifetimes with their partner nominated as their beneficiary using a copula and common-shock model, while Wei et al.{cite}`wei2020optimal` studied optimization strategies for a household with economically and probabilistically dependent persons. Another setting is where certain constraints are used to better describe the financial situation of consumers. Namely, Kronborg and Steffensen {cite}`kronborg2015optimal` discussed two constraints. One constraint is a capital constraint on the savings in which savings cannot drop below zero. The other constrain involves a minimum return in savings. A third setting describes models who analyze the financial market and insurance market in a pragmatic environment. A good illustration is the study of Shen and Wei {cite}`shen2016optimal`. They incorporate all stochastic processes involved in the investment and insurance market where all randomness is described by a Brownian motion filtration. An interesting body of models is involved in time-inconsistent preferences. In this framework, consumers do not have a time-consistent rate of preference as assumed in the economic literature. There exists rather a divergence between earlier intentions and later choices De-Paz et al. {cite}`de2014consumption`. This concept is predominantly described in psychology. Specifically, rewards presented closer to the present are discounted proportionally less than rewards further into the future. An application of time-inconsistent preferences in the consumption, investment, and insurance optimization can be found in Chen and Li {cite}`chen2020time` and De-Paz et al. {cite}`de2014consumption`. These time-inconsistent preferences are rooted in a much deeper behavioral concept called future self-continuity. Future self-continuity can be described as how someone sees himself in the future. In classical economic theory, we assume that the degree to which you identify with yourself has no impact on the ultimate result. In the next subsection, the relationship of future self-continuity and time-inconsistent preferences are more closely looked at and future self-continuity is further examined in the behavioral life-cycle model. 

## The model specifications

In this section, I will set the dynamics for the baseline model in place. The dynamics follow primarily from the paper of Ye {cite}`ye2006optimal`.

Let the state of the economy be represented by a standard Brownian motion $w(t)$, the state of the consumer's wealth be characterized by a finite state multi-dimensional continuous-time Markov chain $X(t)$ and let the time of death be defined by a non-negative random variable $\tau$. All are defined on a given probability space ($\Omega, \mathcal{F}, P$) and $W(t)$ is independent of $\tau$. Let $T< \infty$ be a fixed planning horizon. This can be seen as the end of the working life for the consumer. $\mathbb{F} = \{\mathcal{F}_t, t \in [0,T]\}$, be the P-augmentation of the filtration $\sigma${$W(s), s<t \}, \forall t \in [0,T]$ , so $\mathcal{F}_t$ represents the information at time t. The economy consist of a financial market and an insurance market. In the following section I will construct these markets separetly. 



The financial market consist of a risk-free security $B(t)$ and a risky security $S(t)$, who evolve according to 

$$ \frac{dB(t)}{B(t)}=r(t)dt $$

$$ \frac{dS(t)}{S(t)}=\mu(t)dt+\sigma(t)dW(t)$$

Where $\mu, \sigma, r > 0$ are constants and $\mu(t), r(t), \sigma(t): [0,T] \to R$ are continous. With $\sigma(t)$ satisfying $\sigma^2(t) \ge k, \forall t \in [0,T]$


The random variable $\tau_d$ needs to be first modeled for the insurance  market. Assume that $\tau$ has a probability density function $f(t)$ and probability distribution function given by 

$$ F(t) \triangleq P(\tau < t) = \int_0^t f(u) du $$

Assuming $\tau$ is independent of the filtration $\mathbb{F}$ 

Following on the probability distribution function we can define the survival function as followed

$$ \bar{F}(t)\triangleq P(\tau \ge t) = 1 -F(t) $$

The hazard function is the  instantaneous death rate for the consumer at time t and is defined by 

$$ \lambda(t) = \lim_{\Delta t\to 0} = \frac{P(t\le\tau < t+\Delta t| \tau \ge t)}{\Delta t} $$

where $\lambda(t): [0,\infty[ \to R^+$ is a continuous, deterministic function with $\int_0^\infty \lambda(t) dt = \infty$.

Subsequently, the survival and probability density function can be characterized by 


$$ \bar{F}(t)= {}_tp_0= e^{-\int_0^t \lambda(u)du} $$
$$ f(t)=\lambda(t) e^{-\int_0^t\lambda(u)du} $$ 

With conditional probability described as 

$$ f(s,t) \triangleq \frac{f(s)}{\bar{F}(t)}=\lambda(s) e^{-\int_t^s\lambda(u)dy} $$
$$ \bar{F}(s,t) = {}_sp_t \triangleq \frac{\bar{F}(s)}{\bar{F}(t)} = e^{-\int_t^s \lambda(u)du} $$

    
Now that $\tau$ has been modeled, the life insurance market can be constructed. Let's assume that the life insurance is continuously offered and that it provides coverage for an infinitesimally small period of time. In return, the consumer pays a premium rate p when he enters into a life insurance contract, so that he might insure his future income. In compensation he will receive  a total benefit of $\frac{p}{\eta(t)}$ when he dies at time t. Where $\eta : [0,T] \to R^+ $ is a continuous, deterministic function.

Both markets are now described and the wealth process $X(t)$ of the consumer can now be constructed. Given an initial wealth $x_0$, the consumer receives a certain amount of income $i(t)$ $\forall t \in [0,\tau \wedge T]$ and satisfying $\int_0^{\tau \wedge T} i(u)du < \infty$. He needs to choose at time t a certain premium rate $p(t)$, a certain consumption rate $c(t)$ and a certain amount of his wealth $\theta (t)$ that he invest into the risky asset $S(t)$. So given the processes $\theta$, c, p and i, there is a wealth process $X(t)$  $\forall t \in [0, \tau \wedge T] $ determined by 

$$ dX(t) = r(t)X(t) + \theta(t)[( \mu(t) - r(t))dt +\sigma(t)dW(t)] -c(t)dt -p(t)dt + i(t)dt,   \quad t \in [0,\tau \wedge T] $$

If $t=\tau$ then the consumer will receive the insured amount $\frac{p(t)}{\eta(t)}$. Given is wealth X(t) at time t his total legacy will be 

$$ Z(t) = X(t) + \frac{p(t)}{\eta(t)} $$ 

The predicament for the consumer is that he needs to chose the optimal rates for c, p , $\theta$ from the set $\mathcal{A}$ , called the set of admissible strategies, defined by 

$$ \mathcal{A}(x) \triangleq  \textrm{set of all possible triplets (c,p,}\theta) $$ 

such that his expected utility from consumption, from legacy when $\tau > T$ and from terminal wealth when $\tau \leq T $  is maximized. 

$$ V(x) \triangleq \sup_{(c,p,\theta) \in \mathcal{A}(x)} E\left[\int_0^{T \wedge \tau} U(c(s),s)ds + B(Z(\tau),\tau)1_{\{\tau \ge T\}} + L(X(T))1_{\{\tau>T\}}\right] $$ 

Where $U(c,t)$ is the utility function of consumption, $B(Z,t)$ is the utility function of legacy and $L(X)$ is the utility function for the terminal wealth. $V(x)$ is called the value function and the consumers wants to maximize his value function by choosing the optimal set $\mathcal{A} = (c,p,\theta)$. The optimal set $\mathcal{A}$ is found by using the dynamic programming technique described in the following section. 

## dynamic programming principle 

To solve the consumer's problem the value function needs to be restated in a dynamic programming form. 

$$J(t, x; c, p, \theta) \triangleq E \left[\int_0^{T \wedge \tau} U(c(s),s)ds + B(Z(\tau),\tau)1_{\{\tau \ge T\}} + L(X(T))1_{\{\tau>T\}}| \tau> t, \mathcal{F}_t \right] $$

The value function becomes

$$ V(t,x) \triangleq \sup_{\{c,p,\theta\} \in \mathcal{A}(t,x)} J(t, x; c, p, \theta)  $$

Because $\tau$ is independent of the filtration, the value function can be rewritten as 

$$ E \left[\int_0^T  \bar{F}(s,t)U(c(s),s) + f(s,t)B(Z(\tau),\tau) ds  + \bar{F}(T,t)L(X(T))| \mathcal{F}_t \right]$$ 

The optimization problem is now converted from a random  closing time point to a fixed closing time point. The mortality rate can also be seen as a discounting function for the consumer as he would value the utility on the probability of survival. 

Following the dynamic programming principle we can rewrite this equation as the value function at time s plus the value created from time step t to time step s. This enables us to view the optimization problem into a time step setting, giving us the incremental value gained at each point in time.   

$$ V(t,x) = \sup_{\{c,p,\theta\} \in \mathcal{A}(t,x)} E\left[e^{-\int_t^s\lambda(v)dv}V(s,X(s)) + \int_t^s f(s,t)B(Z(s),s) + \bar{F}(s,t)U(c(s),s)ds|\mathcal{F}_t\right] $$ 

The Hamiltonian-Jacobi-bellman (HJB) equation can be derived from the dynamic programming principle and is as follows

\begin{gather*}
\begin{cases} 
V_t(t,x) -\lambda V(t,x) + \sup_{(c,p,\theta)} \Psi(t,x;c,p,\theta)  = 0 \\ V(T,x) = L(x)  
\end{cases}
\end{gather*}

where 

$$ \Psi(t,x; c,p,\theta) \triangleq r(t)x + \theta(\mu(t) -r(t)) + i(t) -c -p)V_x(t,x) + \\ \frac{1}{2}\sigma^2(t)\theta^2V_{xx}(t,x) + \lambda(t)B(x+ p/\eta(t),t) + U(c,t) $$


Proofs for deriving the HJB equation, dynammic programming principle and converting from a random closing time point to a fixed closing time point can be found in Ye {cite}`ye2006optimal`

A strategy is optimal if  


\begin{gather*}
0 =V_t(t,x) -\lambda(t)V(t,x) + \sup_{c,p,\theta}(t,x;c,p,\theta)  \\
0 = V_t(t,x) -\lambda(t)V(t,x) + (r(t)x+ i(t))V_x + \sup_c\{U(c,t)-cV_x\} + \\ \sup_p\{\lambda(t)B(x + p/\eta(t),t) - pV_x\} + \sup_\theta \{ \frac{1}{2}\sigma^2(t)V_{xx}(t,x)\theta^2 +(\mu(t) - r(t))V_x(t,x)\theta\} 
\end{gather*}


The first order conditions for regular interior maximum are

$$\sup_c  \{ U(c,t) - cV_x\} = \Psi_c(t,x;c^*,p^*,\theta^*)  \rightarrow  0 = -V_x(t,x) + U_c(c*,t) $$

$$ \sup_p\{\lambda(t)B(x + p/\eta(t),t) - pV_x\} = \Psi_p(t,x;c^*,p^*,\theta^*) \\ \rightarrow 0 = -V_x(t,x) + \frac{\lambda(t)}{\eta{t}}B_Z(x + p^*/\eta(t),t) $$


$$ \sup_\theta \{ \frac{1}{2}\sigma^2(t)V_{xx}(t,x)\theta^2 +(\mu(t) - r(t))V_x(t,x)\theta\} = \Psi_\theta(t,x;c^*,p^*,\theta^*)\\ \rightarrow 0 = (\mu(t) -r(t))V_x(t,x) + \sigma^2(t)\theta^*V_{xx}(t,x) $$

The second order conditions are 

$$ \Psi_{cc}, \Psi_{pp}, \Psi_{\theta \theta} < 0 $$ 

### The analytic result: Constant Relative Risk Aversion utility function

This optimal control problem has been solved analytically by Ye {cite}`ye2006optimal` for the Constant Relative Risk Aversion utility function. In this paper however, I will use the analytical result derived by Ye to compare the performance of the NeuralNetDiffEq algorithm and see whether this convergences to the analytical result derived by Ye. Once this is established, other utility function might be used for solving the optimal control problem using the NeuralNetDiffEq algorithm. 

## Algorithm NeuralNetDiffEq

In [13]:
using Plots

# the analytic result of the baseline model for CCRA utility functions to test if algorithm is working 

#parameters 
T=40
y(t)= 50000*exp(0.03*t)
r =0.04 
μ=0.09
σ= 0.18
ρ=0.03
λ(t)= 1/200 + 9/8000 * t
η(t)= 1/200 + 9/8000 * t 
γ = -3 



-3

0.0

## references 

```{bibliography}
```