## See goals below:

- Write out the MP/MRP definitions and MRP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)
- Think about the data structures/class design (in Python 3) to represent MP/MRP and implement them with clear type declarations
- Remember your data structure/code design must resemble the Mathematical/notational formalism as much as possible
- Specifically the data structure/code design of MRP should be incremental (and not independent) to that of MP
- Separately implement the $r(s,s')$ and the $\mathcal{R}(s) = \sum_{s'} p(s,s') \cdot r(s,s')$ definitions of MRP
- Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)
- Write code to generate the stationary distribution for an MP

Firstly, we have some definitions. The value function $v(s)$ is defines as 
$$v(s)=\mathbb{E}\Big[\sum_{i=0}^{\infty}\gamma^iR_{t+i+1}\Big|S_t=s\Big],$$
where $R_i$ is the reward at time $i$ and $\gamma$ is the discount factor. Furthermore, the reward function $\mathcal{R}(s)$ is defined as $$\mathcal{R}(s)=\mathbb{E}[R_{t}|S_{t-1}=s].$$
Following the notations from Sutton's book we also have the function $r(s,s')$ such that 
$$r(s,s')=\mathbb{E}[R_t|S_{t-1}=s\;\cap\;S_t=s']$$ and the function $$p(s,s')=\mathbb{P}(S_t=s'|S_{t-1}=s).$$
Thus we see that 
\begin{equation*}
    \begin{split}
        \mathcal{R}(s)&=\mathbb{E}[R_{t}|S_{t-1}=s]\\
        &=\sum_{s'}R_t(\{\texttt{reward after state }
        s'\})\mathbb{P}(S_t=s'|S_{t-1}=s)\\
        &=\sum_{s'}\mathbb{E}[R_t|S_{t-1}=s\;\cap\;S_t=s']\mathbb{P}(S_t=s'|S_{t-1}=s)\\
        &=\sum_{s'}r(s,s')p(s,s').
    \end{split}
\end{equation*}

First we import the necesarry packages.

In [1]:
import numpy as np

**Now we define a Markov Process (MP) as a class.**

In [2]:
class MP(object):
    def __init__(self, States:list, ProbDistribution, print_info:bool=False, Name:str="Nameless"):
        self.Name=Name
        self.States = States
        self.ProbDistribution = ProbDistribution
        self.print_info = print_info
        if print_info == True:
            print("The Markov process", Name, "has been created. It has", len(States), "states.")
    
    def Generate_Stationary_Dist(self):
        eigenvalues, eigenvectors = np.linalg.eig(self.ProbDistribution.T)
        stat_dist=np.zeros((len(self.States),1))
        for i in range(len(eigenvalues)):
            if abs(eigenvalues[i]-1) < 1e-8:
                stat_dist = stat_dist + eigenvectors[:,i].reshape(len(self.States),1)
        return((stat_dist/np.sum(stat_dist)).reshape(1,len(self.States)))
    
    def Simulate(self,steps:int,start=None,print_text=False):
        #steps=len(self.States)
        if start == None:
            start = self.States[0]
        path = [start]
        current_activity = start
        i=0
        while i < steps:
            for j in range(len(self.States)):
                if current_activity == self.States[j]:
                    RV = np.random.choice(self.States,replace=True,p=self.ProbDistribution[j])
                    for k in range(len(self.States)):
                        if RV == self.States[k]:
                            if self.ProbDistribution[j][k] == 1 and k == j:
                                if print_text == True:
                                    print("The procces reached the termination state", "'",self.States[j],"'", "after", i, "steps.")
                                i = steps
                                break
                            path.append(self.States[k])
                            current_activity = self.States[k]
                            break
                    break
            i += 1
        if print_text == True:
            print("The path was:", path)
        return(path)
    

The class MP is defined by it's states and their probabilites. Now we can test the two functions implemented in the Markov process.

In [3]:
Test_MP = MP(["Sleep","Wake up","Eat"],np.array([1/2,1/4,1/4,0,2/3,1/3,1/3,1/3,1/3]).reshape(3,3))

In [4]:
Test_MP.Generate_Stationary_Dist()

array([[0.21052632, 0.47368421, 0.31578947]])

In [5]:
Test_MP.Simulate(5,"Sleep",True)

The path was: ['Sleep', 'Sleep', 'Sleep', 'Wake up', 'Eat', 'Eat']


['Sleep', 'Sleep', 'Sleep', 'Wake up', 'Eat', 'Eat']

**Now we can define a Markov Reward Process**

In [6]:
class MRP(MP):
    def __init__(self,States,ProbDistribution,Rewards,gamma):
        MP.__init__(self, States, ProbDistribution)
        self.Rewards=Rewards
        self.gamma=gamma
        
    def Get_Expected_Reward_one_state(self,start=None):
        if start == None:
            return(np.dot(self.ProbDistribution,self.Rewards))
        else:
            for i in range(len(self.States)):
                if self.States[i]==start:
                    return((np.dot(self.ProbDistribution,self.Rewards))[i])
                
    def Get_Value_Function(self):
        R=np.dot(self.ProbDistribution,self.Rewards)
        inverse=np.linalg.inv(np.identity(len(self.States))-self.gamma*self.ProbDistribution)
        return(np.dot(inverse,R))
    
    def rss_to_RS(self):
        RS=self.Get_Expected_Reward_one_state()
        return(RS)
        #I do not at all understand what I am supposed to do here 
            

In [8]:
Test_MRP = MRP(Test_MP.States,Test_MP.ProbDistribution,np.array([1,2,3]).reshape(3,1),0.99)

In [9]:
Test_MRP.Get_Value_Function()

array([[209.91591902],
       [210.96027683],
       [210.28230542]])