Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mRMRe::mrmr differs from praznik::mrmr #2604

Closed
pat-s opened this issue Jun 19, 2019 · 2 comments
Closed

mRMRe::mrmr differs from praznik::mrmr #2604

pat-s opened this issue Jun 19, 2019 · 2 comments

Comments

@pat-s
Copy link
Member

pat-s commented Jun 19, 2019

cross-ref https://notabug.org/mbq/praznik/issues/2

I don't think we can do anything about this but it is important to know.

Note that the absolute values are not of interest here but only the ranking of the features.

suppressPackageStartupMessages(library(mlr))
library(magrittr)

bh.task = dropFeatures(bh.task, "chas")

fv_mrmr = generateFilterValuesData(bh.task, "mrmr")
fv_mrmr_praznik = generateFilterValuesData(bh.task, "praznik_MRMR")

purrr::map(list(fv_mrmr$data, fv_mrmr_praznik$data), ~ 
    dplyr::arrange(.x, method, desc(value))) %>% 
  print()
#> [[1]]
#> # A tibble: 12 x 4
#>    name    type    method    value
#>    <chr>   <chr>   <chr>     <dbl>
#>  1 lstat   numeric mrmr    0.393  
#>  2 rm      numeric mrmr    0.0940 
#>  3 ptratio numeric mrmr    0.0776 
#>  4 b       numeric mrmr    0.0269 
#>  5 indus   numeric mrmr    0.0189 
#>  6 crim    numeric mrmr    0.0106 
#>  7 zn      numeric mrmr   -0.00239
#>  8 tax     numeric mrmr   -0.0295 
#>  9 age     numeric mrmr   -0.0495 
#> 10 nox     numeric mrmr   -0.0911 
#> 11 rad     numeric mrmr   -0.135  
#> 12 dis     numeric mrmr   -0.160  
#> 
#> [[2]]
#> # A tibble: 12 x 4
#>    name    type    method        value
#>    <chr>   <chr>   <chr>         <dbl>
#>  1 lstat   numeric praznik_MRMR 1     
#>  2 ptratio numeric praznik_MRMR 0.917 
#>  3 rm      numeric praznik_MRMR 0.833 
#>  4 crim    numeric praznik_MRMR 0.75  
#>  5 age     numeric praznik_MRMR 0.667 
#>  6 b       numeric praznik_MRMR 0.583 
#>  7 nox     numeric praznik_MRMR 0.5   
#>  8 zn      numeric praznik_MRMR 0.417 
#>  9 tax     numeric praznik_MRMR 0.333 
#> 10 rad     numeric praznik_MRMR 0.25  
#> 11 dis     numeric praznik_MRMR 0.167 
#> 12 indus   numeric praznik_MRMR 0.0833

Created on 2019-06-19 by the reprex package (v0.3.0)

In addition, here is a runtime comparison for a dataset with ~ 7k features:

build_times(fv_nri_praznik_mrmr, fv_nri_mrmr)
# A tibble: 2 x 4
  target              elapsed             user                 system        
  <chr>               <S4: Duration>      <S4: Duration>       <S4: Duration>
1 fv_nri_mrmr         6487s (~1.8 hours)  6495s (~1.8 hours)   0.111s        
2 fv_nri_praznik_mrmr 144s (~2.4 minutes) 290s (~4.83 minutes) 0.144s   
@mbq
Copy link

mbq commented Jun 19, 2019

AFAICT, mRMRe calculates mutual information by estimating from correlation between features, using the formula I(x,y)=−.5 log (1−cor(x,y)^2)* (this is why it expects numerical input), while praznik from a direct MLE formula I(x,y)=p_xy log (p_xy/p_x/p_y) for categorical variables (and this is why it expects factors)... So these are basically different algorithms, or let's say, versions of mRMR using different interface to the data.

EDIT: (*) is 2.8 from here, and it holds only for normal x, y and bivariate normal xy.

@pat-s
Copy link
Member Author

pat-s commented Jul 13, 2019

@mbq Thanks for the clarification!

@pat-s pat-s closed this as completed Jul 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants