# Lecture 12

The Rust (1987) data is a bit unusual, in that it is an exceptionally popular model, but the data doesn't exist in any sort of particularly clean format! So we will extract the original data ourselves and see if we can form the requiste variables ourselves. This is also a good chance to flex your `R` programming skills.

We are going to be working with the actual Rust (1987) dataset, retrieved from https://editorialexpress.com/jrust/nfxp.html. In our working directory we have the 9 `.asc` files in the `dat` subdirectory as well as the `nfxp_man.pdf` readme document. Let's see if we can reconstruct the Rust dataset using this.

The following excerpt from John Rust's original `nfxp_man.pdf` file for
the bus data describes the nature and format of the data files:

    This directory contains data on odometer readings and dates of
    bus engine replacements of 162 buses in the fleet of the Madison
    Metropolitan Bus Company that were in operation sometime during
    the period December, 1974 to May, 1985. The documentation of
    the contents of the files is described in more detail in chapter
    4 of the documentation manual.
    
    The directory contains the following files, each
    corresponding to a different model/vintage of bus
    in the Madison Metro fleet:
    
    D309.ASC     110x4 matrix for Davidson model 309 buses
    G870.ASC     36x15 matrix for Grumman model 870 buses
    RT50.ASC     60x4  matrix for Chance model RT50 buses
    T8H203.ASC   81x48 matrix for GMC model T8H203 buses
    A452372.ASC 137x18 matrix for GMC model A4523 buses, model year 1972
    A452374.ASC 137x10 matrix for GMC model A4523 buses, model year 1974
    A530872.ASC 137x18 matrix for GMC model A5308 buses, model year 1972
    A530874.ASC 137x12 matrix for GMC model A5308 buses, model year 1974
    A530875.ASC 128x37 matrix for GMC model A5308 buses, model year 1975
    
    The data in each file are vectorized into a single column: e.g.
    D309.ASC is a 440x1 vector consisting of the columns
    of a 110x4 matrix stacked on top of each other consecutively.

Because the `.asc` files are stored as vectors, we need to supply information about the sizes of each matrix.

In [5]:
thepath = getwd()

fileArr = c("d309.asc", "g870.asc", "rt50.asc", "t8h203.asc", "a530875.asc", "a530874.asc", "a452374.asc", 
    "a530872.asc", "a452372.asc")
nbRowsArr = c(110, 36, 60, 81, 128, 137, 137, 137, 137)
nbColsArr = c(4, 15, 4, 48, 37, 12, 10, 18, 18)

Note: it looks like Rust does not include the d309 model, which is why he has $162$ buses in his dataset, not $166$. Hence we will not include `d309.asc`. 

In [6]:
toInclude = 2:9
fileArr = fileArr[toInclude]
nbRowsArr = nbRowsArr[toInclude]
nbColsArr = nbColsArr[toInclude]

nbBuses = sum(nbColsArr)
nbMonths = max(nbRowsArr)-11 # number of months in the period
print(paste0('The number of buses is ', nbBuses))
print(paste0('The number of months is ', nbMonths))

[1] "The number of buses is 162"
[1] "The number of months is 126"


    Each of the 8 raw data files is a GAUSS matrix file. That is, each file is read into memory as a $T \times M$ matrix, where M is the number of buses in the file and T is the number of data records per bus. Thus, each column of the matrix file contains data for a single bus. The first eleven rows of the matrix are the file “header” that contains information on the bus number, its date of purchase, the dates and odometer readings of engine replacements, and the month and year of the first odometer observation. The remaining T - 11 rows of the matrix contain the consecutive monthly odometer readings for each bus (with the exception of a two month gap to account for the strike during July and August, 1980). Specifically, the header contains the following information:

| Row  | Item                            | Sample Entries |
| ---- | ------------------------------- | -------------- |
| 1    | bus number                      | 5297           |
| 2    | month purchased                 | 8              |
| 3    | year purchased                  | 75             |
| 4    | month of 1st engine replacement | 4              |
| 5    | year of 1st engine replacement  | 79             |
| 6    | odometer at replacement         | 153400         |
| 7    | month of 2nd replacement        | 0              |
| 8    | year of 2nd replacement         | 0              |
| 9    | odometer at replacement         | 0              |
| 10   | month odometer data begins      | 9              |
| 11   | year odometer data begins       | 75             |
| 12   | odometer reading 1              | 2353           |
| 13   | odometer reading 2              | 6299           |
| 14   | odometer reading 3              | 10479          |

In [7]:
n = 10 # Number of discretization points
omax = 450000 # maximum odometer value

curbus = 0  # Current bus 
output = array(NA, dim = c(nbBuses, nbMonths, 3))  # output matrix
outputdiscr = array(NA, dim = c(nbBuses, nbMonths, 3))  # Discretized output matrix 
transitions = matrix(0, n, n)  # Transition matrix
pi0_x = rep(0, n)

for (busType in 1:length(fileArr)) {
    
    thefile = fileArr[busType]
    nbRows = nbRowsArr[busType]
    nbCols = nbColsArr[busType]
    tmpdata = read.csv(paste0(thepath, "/datafiles/", thefile), sep = "\r", header = FALSE)
    if (dim(tmpdata)[1] != nbRows * nbCols) {
        stop("Unexpected size")
    }
    tmpdata = matrix(as.matrix(tmpdata), nbRows, nbCols)
    
    print(paste0("Group = ", busType, "; Nb at least one = ", length(which(tmpdata[6, 
        ] != 0)), "; Nb no repl = ", length(which(tmpdata[6, ] == 0))))
    
    for (busId in 1:nbCols) {
        curbus = curbus + 1
        # First replacement
        mo1stRepl = tmpdata[4, busId]
        ye1stRepl = tmpdata[5, busId]
        odo1stRep = tmpdata[6, busId]
        
        # Second replacments
        mo2ndRepl = tmpdata[7, busId]
        ye2ndRepl = tmpdata[8, busId]
        odo2ndRep = tmpdata[9, busId]
        
        # First odometer reading
        moDataBegins = tmpdata[10, busId]
        yeDataBegins = tmpdata[11, busId]
        
        # Odometer reading
        odoReadings = tmpdata[12:nbRows, busId]
        wasreplacedonce = ifelse((odoReadings >= odo1stRep) & (odo1stRep > 0), 1, 
            0)
        wasreplacedtwice = ifelse((odoReadings >= odo2ndRep) & (odo2ndRep > 0), 1, 
            0)
        howmanytimesreplaced = wasreplacedonce + wasreplacedtwice
        
        correctedmileage = odoReadings + howmanytimesreplaced * (howmanytimesreplaced - 
            2) * odo1stRep - 0.5 * howmanytimesreplaced * (howmanytimesreplaced - 
            1) * odo1stRep  # Resets odometer to 0 when engine is replaced
        
        
        output[curbus, 1:(nbRows - 12), 1] = howmanytimesreplaced[2:(nbRows - 11)] - 
            howmanytimesreplaced[1:(nbRows - 12)]  # replacement decision
        output[curbus, 1:(nbRows - 12), 2] = correctedmileage[1:(nbRows - 12)]  #corrected odometer readings
        output[curbus, 1:(nbRows - 12), 3] = tmpdata[13:nbRows, busId] - tmpdata[12:(nbRows - 
            1), busId]  # change in odometer readings
        
        outputdiscr[curbus, , 1] = output[curbus, , 1]  # Copy copy across replacement decision
        outputdiscr[curbus, , 2:3] = ceiling(n * output[curbus, , 2:3]/omax)  # Discretize
        
        # Compute transition matrix, conditional on no replacement
        for (t in 1:(nbRows - 13)) {
            # If no replacement
            if (outputdiscr[curbus, t, 1] == FALSE) {
                i = outputdiscr[curbus, t, 2]
                j = outputdiscr[curbus, t + 1, 2]
                transitions[i, j] = transitions[i, j] + 1
                pi0_x[i] = pi0_x[i] + 1
            }
        }
    }
}

[1] "Group = 1; Nb at least one = 0; Nb no repl = 15"
[1] "Group = 2; Nb at least one = 0; Nb no repl = 4"
[1] "Group = 3; Nb at least one = 27; Nb no repl = 21"
[1] "Group = 4; Nb at least one = 32; Nb no repl = 5"
[1] "Group = 5; Nb at least one = 11; Nb no repl = 1"
[1] "Group = 6; Nb at least one = 7; Nb no repl = 3"
[1] "Group = 7; Nb at least one = 18; Nb no repl = 0"
[1] "Group = 8; Nb at least one = 18; Nb no repl = 0"


Recall that
\begin{align*}
U_{x}=\sum_{x^{\prime}}\beta U_{x^{\prime}}P_{x^{\prime}|x0}-\log\pi_{0|x}
\end{align*}
Which can be rewritten as 
\begin{align*}
L = (\beta \Pi - I) U
\end{align*}
And when $(\beta \Pi - I)$ is invertible
\begin{align*}
U=(\beta\Pi-I)^{-1}L
\end{align*}
Thus if we normalize `transitions` this will give us $\Pi_{xx'}$ and `pi_0_x` is $L$. Hence

In [20]:
# Specify beta (model is not well identified unless we specify beta)
beta = .99
trim = 10

# Compute stochastic matrix
Pi = transitions / rowSums(transitions)

# Get rid of NAN's
Pi_trim = Pi[1:trim, 1:trim]

Pi_trim

0,1,2,3,4,5,6,7,8
0.9392312,0.0607688,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,0.941967,0.05803299,0.0,0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.95080321,0.04919679,0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.93762677,0.06237323,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.9551318,0.0448682,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0,0.968335,0.03166496,0.0,0.0
0.0,0.0,0.0,0.0,0.0,0.0,0.96969697,0.03030303,0.0
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98630137,0.01369863
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In the data, we have that conditional on $y = 0$, the highest state is an absorbing state, i.e. $Pr(x' = 9 | x = 9) = 1$, because there is no exogenous replacement. This will make estimation impossible, so we will exogenous set $Pr(x' = 9 | x = 9) = .75$ and the remaining mass tranisitions to state 1.

In [21]:
Pi_trim[9,9] = .75
Pi_trim[9,1] = .25

In [23]:
L = pi0_x[1:trim]
L_norm = L / sum(L)

U = solve(beta* Pi_trim - diag(trim)) %*% L_norm

print(U)

            [,1]
 [1,] -13.592099
 [2,] -11.962172
 [3,] -10.275594
 [4,]  -8.328148
 [5,]  -7.564851
 [6,]  -6.613311
 [7,]  -6.657610
 [8,]  -7.931109
 [9,] -13.067333
