New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble adding boundary penalties in matrix format #92

Closed
javierfajnolla opened this Issue Jun 22, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@javierfajnolla
Copy link

javierfajnolla commented Jun 22, 2018

Hi!

I am finding troubles addint boundary penalties to my conservation problem using matrix format (dgCMatrix). The problem is built with features as a matrix, and it is possible to solve normally before adding these penalties. However, after boundary penalties are added, the problem becomes infeasible.

This is how my problem looks when printed before adding the boundary penalties:

> pr
Conservation Problem
  planning units: matrix (55585 units)
  cost:           min: 0, max: 3461.30884
  features:       Abarema_auriculata, Abarema_barbouriana, Abarema_jupunba, ... (13352 features)
  objective:      Minimum set objective 
  targets:        Absolute targets [targets (min: 0, max: 9297.1)]
  decisions:      Binary decision 
  constraints:    <Locked in planning units [7834 locked units]
                   Locked out planning units [556 locked units]>
  penalties:      <none>
  portfolio:      default
  solver:         Gurobi [first_feasible (0), gap (0.01), presolve (2), threads (16), time_limit (2147483647), verbose (1)]

Notice that planning units and features were used in matrix format to build the problem. I am able to solve the problem up to this point.

However, I would like to add_boundary_penalties(). With a conservation problem built with matrixes, I need to provide the data argument to the function.

I built the matrix of boundaries using the boundary_matrix() function on a raster from which I made the matrix of planning units. I noticed that this function on a raster does not ignore NA, so I needed to remove rows and columns in the resulting matrix to leave only those that match positions of pixels that are not NA. Like this:

# Make matrix of boundaries
pu_boundary <- boundary_matrix(pu)
# Make a table where I can find which pixels in my pu raster are not NA
pu_values <- tibble(pu_index = values(pu_index),
                    pu = values(pu))
# Find a vector with indices of pu that are not NA
pu_with_values <- pu_values %>% 
  filter(!is.na(pu)) %>% 
  pull(pu_index)
# Take columns and rows of the boundary matrix that match those indexes
pu_boundary <- pu_boundary[pu_with_values, pu_with_values]

I haven't seen too many of these, but the matrix looks good to me. This is a fragment:

> pu_boundary %>% head
6 x 55585 sparse Matrix of class "dgCMatrix"
                                                                                    
[1,] 0.3333334 .          .          .          .          .          . . .         
[2,] .         0.25000001 0.08333334 .          .          .          . . .         
[3,] .         0.08333334 0.08333334 0.08333334 .          .          . . 0.08333334
[4,] .         .          0.08333334 0.08333334 0.08333334 .          . . .         
[5,] .         .          .          0.08333334 0.08333334 0.08333334 . . .         
[6,] .         .          .          .          0.08333334 0.16666668 . . .         

However, using this matrix makes the problem unfeasible:

> sol <- pr %>%
     add_boundary_penalties(penalty = 1, edge_factor = 1, data = pu_boundary) %>%
     solve()

Optimize a model with 232990 rows, 165404 columns and 31884207 nonzeros
Variable types: 0 continuous, 165404 integer (165404 binary)
Coefficient statistics:
  Matrix range     [1e+00, 4e+00]
  Objective range  [2e-01, 3e+03]
  Bounds range     [1e+00, 1e+00]
  RHS range        [5e-01, 9e+03]
Presolve removed 4 rows and 8378 columns
Presolve time: 3.16s

Explored 0 nodes (0 simplex iterations) in 7.65 seconds
Thread count was 1 (of 20 available processors)

Solution count 0

Model is infeasible
Best objective -, best bound -, gap -
Error in solve(pr) : conservation problem is infeasible

Later, I tried modifying a bit further the matrix, following what the function add_boundary_penalties() does:

pu_boundary <- Matrix::forceSymmetric(pu_boundary, uplo = "L")
class(pu_bound) <- "dgCMatrix"

But the result was exactly the same.

Do you see something wrong in the way I am calculation the matrix of boundaries? Is the problem somewhere else?

Any help is much appreciated. And thanks a lot for all the development in the package, it's being great to use!
Best regards

PD: I had to build the problem using a self-constructed dgCMatrix for features because the number of species was too large, and the problem did not finished building itself in raster format after 3 weeks.

@jeffreyhanson

This comment has been minimized.

Copy link
Contributor

jeffreyhanson commented Jun 23, 2018

Hey,

Thanks for getting in touch. I can't see anything immediately incorrect with the code you've posted, so I'm going to need a bit more information in order to reproduce it and find a fix. Could you please try running your code using the built-in planning unit and feature data (sim_pu_raster and sim_features) and see if this still results in an infeasible problem? Also, if you could post (or email me) the complete script that uses the built-in data that would be very helpful.

Also, the problem function has a skip_checks argument that you can set to TRUE to skip running checks on the data which can speed up building problems with large data sets.

@javierfajnolla

This comment has been minimized.

Copy link

javierfajnolla commented Jun 25, 2018

Hi and thanks a lot for offering the help!

I tried your suggestion and I was able to solve the problem with boundary penalties for the example data using my code to make the rij_matrix. Then, I am a bit loss at this point...

Can you think of any case where adding boundary penalties makes a minimum set objective problem unfeasible? If I am thinking it right, penalties could make this type of problems expensive, but not unfeasible...

I still think that the problem is in my boundary matrix, because the problem can be solved before adding boundary penalties. I have tried converting my planning units from raster to polygon (with rasterToPolygon()) with the same negative result. I thought that the problem when using a matrix built from a raster planning units layer might appear when removing the rows and columns that correspond with pixels at the sea (with NA value), but using polygons (where only PUs with data are preserved, so dimmensions are correct right after using boundary_matrix()) resulted in the same unfeasible solution.

The most surprising thing for me was that adding penalties with a penalty of 0 also made the problem unfeasible. Wouldn't it be equivalent not to use penalties and using a 0 penalty?

PD: I used skip_checks = TRUE, but it had not finished after 3 weeks of waiting, and I decided trying by building matrixes myself (what did not take longer than ~4hours using a for() loop.

@jeffreyhanson

This comment has been minimized.

Copy link
Contributor

jeffreyhanson commented Jun 26, 2018

No worries. I'm sorry you're having this problem with prioritizr.

Yeah, that's strange. The boundary penalties should never make a feasible problem become infeasible.

What happens when you remove the locked in and locked out constraints but leave in the boundary penalties?

What happens when you run it with 10 species? If you still have the problem with 10 species, would you mind sending me data for the planning units and the first 10 species, so I can track down what the issue is?

Yeah, when you specify a penalty of zero this means that the penalties should not be applied - this is bizarre.

Which version of prioritizr are you running? If you haven't already, could you please try it with the latest version of prioritizr on GitHub?

Ah ok, it looks like we might need to try skipping more things when skip_checks = TRUE because it shouldn't take that long when skipping the checks. As an aside though, have you tried increasing the raster maxmemory and chunk options in rasterOptions? I've found this can speed things up in general with raster processing.

@javierfajnolla

This comment has been minimized.

Copy link

javierfajnolla commented Jun 26, 2018

Great, and thanks again!
Sorry if I am making you lose some time... but hope is that this is useful for you or someone else at some point.

So, I found out new things. With my new testing, I am thinking that the problem is not in the boundary penalties anymore. Instead, it seems related with a combination in locked-out constraints, the random "id" given to newly declared conservation problems and, weirdly, something changing in some environment (gurobi's options?). Does that make any sense?

The combination failing in my tests is: a conservation problem (with a certain random and internally generated "id") with locked-out constraints that is being solved for the second (or 3rd, 4th...) time. It seems to me that the combination also requires that the problem is built using rij_matrix (and that the matrix is large?).

Now, the long explanation: when I run problems for the first time (with or without boundary penalties, but with locked-out constraints) it is always feasible (as I am using targets adjusted to be feasible). As a reminder, it is a min_set_objective, with binary decisions that uses gurobi as a solver.

But then, if I try solving again the same problem (right after repeating the same line of code, with no other intermediary code run), it is infeasible!
If I try a third, a forth time... it will still be infeasible. If I modify the object after solving it the first time (for instance, by adding penalties to it), it will still be infeasible (but if those penalties had been added before solving it the first time, they would prove feasible). However, if I rebuild or reload the problem (previously saved as an RDS) after solving it a first problem, the problem is feasible (as many times as I try... taken that the problem object is reloaded or rebuilt). I have checked and every time I reload or rebuild the conservation problem it gets a new different random "id", so that could be connected to the issue (for instance, if I read a problem saved as an RDS two times, the function all.equal() finds 68 differences between objects, referring to different "id" components inside different parts of the conservation problem object, for instance, in problem$solver$parameters).

There are some differences in gurobi's output when I solve the problem the first time and when it is the 2nd, 3rd... The first few lines are the same always (description of the models in rows, columns, integer variables, coefficient statistics...). But then they are different: the output for the first time continues with "found heuristic solution: objective XXXX", and then start a presolving that takes ~120s and prints 8-9 lines as it removes more rows and columns. Contrastingly, the output from the second, third time... starts the presolve right after the "coefficient statistics" lines (with the four "range", "matrix", "objective", "bounds" and "RHS"). So, there is no "Found heuristic solution: ..." line. Also, presolve only prints one line and ends in ~4s.
This makes me think that gurobi's is remembering somehow (options? a temporary file?) some presolving data tagged by the "id", but it is failing when attempting to re-use them.

In summary, it seems to me that when a conservation problem is built (or read from an RDS object), an "id" is randomly given to it. When this problem is solved the first time, something in gurobi options or temporary files stores precomputed data for that "id" problem, and later produces the infeasible error when trying to solve the problem with the same "id".

Is this possible? Sorry, I do not really know the details for how does the package communicates with gurobi.

I have reproduced this problem ~20 times always with the same result, using slightly different conservation problems. In all my tests I have used the same planning units (for the north of South America, ~55000 PUs). I found the problem using ~13000 species as features, using a different table of ~13000 species (same species, but future projections of distributions, so a different table), using a table of ~4000 different species, and using the ~4000 future projections of these ones. In all those tests, the first time is feasible, from the second solving try in advance, infeasible (and eventually R crashes in the 3rd or 4th). However, and this is very weird, I do not find the problem with the example (and small) data in the prioritizr package (also built using matrixes) (it does not become infeasible no matter how many times I solve it). Nonetheless, I am not sure that the problem only affects large conservation problems built with matrixes. I do not have as large problems built with raster data as to test it.

I have made these tests on a window computer (32 Gb RAM and 20 cores) using R 3.4.4 with gurobi 8 and last version of prioritizr (v4.0.1.5), but I have also repeated some of them in a mac laptop (8 Gb and 4 cores) using R 3.5, gurobi 8 and the same last version of prioritizr (v4.0.1.5).

I can continue my work now by re-reading a saved RDS with the problem before every solving step... but it is not ideal (the RDS is ~200Mb, takes time), and I am really intrigued about what is going on! Moreover, maybe this will affect me or other people at some point... So, if you have any clue...

Thanks very much!!!
Regards,
Javier

@jeffreyhanson

This comment has been minimized.

Copy link
Contributor

jeffreyhanson commented Jun 27, 2018

Yeah, it's definitely not ideal and sounds like a serious bug. Thank you again for raising this. It can't really debug this without a reproducible example though. Could you please email me a copy of your script so I can exactly what you're doing?

Also, could you try running the code below to verify if prioritizr is copying internal objects correctly:

library(testthat)
data(sim_pu_polygons, sim_features)
p1 <- problem(sim_pu_polygons, sim_features, "cost")
p2 <- p1 %>% add_locked_in_constraints("locked_in")
expect_equal(p1$constraints$length(), 0)
expect_equal(p2$constraints$length(), 1)
@javierfajnolla

This comment has been minimized.

Copy link

javierfajnolla commented Jun 27, 2018

Thank you very very much!

I sent you the script to your email.

As per the code you gave me, there is no output from the testthat::expect_equal(), so I guess that it works as you expected: they are equal.

For the record, I have been able to test that the same large problem (>13000 sp) does not turn infeasible after a first solving when using raster-built conservation problems.
I had to use a workaround to test that in the step of building the raster-conservation problem (which took extremely long times using the 13000 layer raster stack, as I started this thread). I tweaked a raster-problem built with a small subset of the species and then replaced all necessary parts of the conservation object by the data for all the species. Please, tell me if you find that this approach can produce misleading results, as I might continue using those tweaked problems... but they seem fine to me so far.
I tweaked them with a code that looks something like this:

# Build the problem with a few of the species
p <- problem(cost_opp, stack_few_sp)  #cost_pu is the raster with cost data for my planning units; stack_few_sp is a stack with 10 of my species.

## Tweak the object
# Replace the feature stack of species (the current with only 10 species, the replacement with >13000sp)
p$data$features[[1]] <- stack_all_sp

# Replace the rij_matrix by manually calculated rij_matrix
p$data$rij_matrix <- list(cost_opp = m)   # cost_opp is the single zone name

# Change absolute abundances for features by manually calculated abundancies
p$data$feature_abundances_in_total_units <- feat_ab  # feat_ab is a matrix with 1 column, named "cost_opp", and 1 row per feature. Values in the column are total abundacies and row.names are feature names, as in the raster stack layers.

# Change attribute with names of species
attributes(p$data$features)$feature_names <- feat_names  # character vector with all feature names, matching names in raster stack.

## Continue by adding more things and solving the problem

So, the main point is that the issue, if there is one, does not seem to affect raster-problems.

Thanks again, and look forward hearing from you!

@jeffreyhanson

This comment has been minimized.

Copy link
Contributor

jeffreyhanson commented Jun 28, 2018

Thank you very much for your help @javierfajnolla and for persisting with this. I really appreciate it.

Yeah, the code I posted was testing if constraints were getting added correctly to problem objects, and they were being added correctly, so it didn't throw an error.

Thanks to you, I was able work out that the issue was occurring due to a bug in the add_manual_locked_constraints (and so also affects add_locked_in_constraints and add_locked_out_constraints). Specifically, the bug was such that if you attempted to solve the same problem object multiple times with locked in or locked constraints added it, the code would lock in or out an incorrect set of planning units. This means that if you did not run the solve function on the same problem object multiple times then the code worked correctly, and the correct planning units were locked in or out. This bug likely occurs throughout all versions between and including 4.0.0--4.0.1.5. To ensure that this bug doesn't rear its head again, I've also added additional unit tests to the new developmental version (4.0.1.6).

@javierfajnolla, if you're interested in why you were getting infeasible problems upon attempting to solve the problem a second time, this bug was causing the wrong planning units to be locked out when you tried solving a problem a second time, and because some of these planning units were needed to meet the targets, the problem became infeasible.

Thank you again @javierfajnolla, if we catch up sometime, remind me to buy you a beer!

@javierfajnolla

This comment has been minimized.

Copy link

javierfajnolla commented Jul 3, 2018

Awesome, I really like beer!

But there is really no need, it pays off with all the work you all have done with prioritizr. I am happy to help as part of the community.

Thanks for the explanation, the errors make sense now. I am still intrigued about where were locked planning units changing. Because the conservation problems were not changing when solving them for the first time... Was it at some files that the package is storing somewhere?

Anyway, thanks for the update!

@jeffreyhanson

This comment has been minimized.

Copy link
Contributor

jeffreyhanson commented Jul 4, 2018

No worries.

The prioritizr package doesn't save data in files on the computer's disk building or solving problems, it's all stored in the computer's memory (which is why you can save problem objects using the save or saveRDS functions; though with exception to large/file-backed raster data sets which the raster package will store on disk unless manually requested to load all data into R).

The specific lines in the source code causing the bug were 242--244 in R/add_manual_locked_constraints (see here for diff: 0807865#diff-f3eee206eec8d70d5f3a7bc903bbcc21). Specifically, when prioritizr compiled the problem it sent the locked in/out data to a c++ function (rcpp_apply_locked_constraints; https://github.com/prioritizr/prioritizr/blob/master/src/rcpp_apply_locked_constraints.cpp) which has some code that subtracts one from the zone and planning unit ids (lines 11--12) because c++ uses base-zero indexing. However, because Rcpp let's you access the same objects as the R interpreter, this caused the data stored in the problem to also have one subtracted from the planning unit ids (resulting incorrect planning units being locked in/out) and one subtracted from the zone ids (occasionally resulting in a segfault R crash). Since we only use the locked data once when compiling the problem, this means we get the correct solution when solving the problem once, but if we try solving the problem again, we would be using the planning unit and zone ids that have already had one subtracted from them, and the c++ code would subtract one again.

To fix this, I wrapped the planning unit ids and zone ids in c() to create a copy of the data stored in the problem (242--244 in R/add_manual_locked_constraints), and so this means that data in stored in the problem object is unchanged when we subtract one from the copied data in the c++ code (lines 11--12 in https://github.com/prioritizr/prioritizr/blob/master/src/rcpp_apply_locked_constraints.cpp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment