Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task doesn´t accept foreign language #2568

Closed
Hadsga opened this issue Apr 15, 2019 · 6 comments
Closed

Task doesn´t accept foreign language #2568

Hadsga opened this issue Apr 15, 2019 · 6 comments

Comments

@Hadsga
Copy link

Hadsga commented Apr 15, 2019

I have a data set with Russian column names:

`glimpse(dat)
Observations: 1,545,898
Variables: 43
$ year                  <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
$ month                 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ shop_id               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ item_category_id      <int> 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ item_id               <int> 16255, 5740, 5570, 5572, 5573, 5574, 5576, 56...
$ item_cnt_month        <dbl> 1.000000, 1.000000, 1.000000, 1.571429, 1.000...
$ item_cnt_month_lag    <dbl> NA, NA, NA, 1.666667, 1.000000, NA, 1.000000,...
$ item_price_lag        <dbl> NA, NA, NA, 1322, 560, NA, 2231, 2381, NA, 29...
$ item_cnt_month_lag2   <dbl> NA, NA, 1.177778, 1.177778, 1.177778, 1.17777...
$ item_price_lag2       <dbl> NA, NA, 1938.6889, 1938.6889, 1938.6889, 1938...
$ item_cnt_month_lag3   <dbl> 1.163781, 1.163781, 1.163781, 1.163781, 1.163...
$ item_price_lag3       <dbl> 531.262, 531.262, 531.262, 531.262, 531.262, ...
$ city.Адыгея           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Балашиха         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Волжский         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Вологда          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Воронеж          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Выездная         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Жуковский        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Интернет.магазин <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Казань           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Калуга           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Коломна          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Красноярск       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Курск            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Москва           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Мытищи           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Н.Новгород       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Новосибирск      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Омск             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.РостовНаДону     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Самара           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Сергиев          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.СПб              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Сургут           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Томск            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Тюмень           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Уфа              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Химки            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Цифровой         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Чехов            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Якутск           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ city.Ярославль        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...`

If I try to create a task I get this error:

Error in makeTask(type = type, data = data, weights = weights, blocking = blocking, : Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
@GegznaV
Copy link
Contributor

GegznaV commented Apr 15, 2019

Some general notes:

  1. Is your R locale set to Russian? If you are on Windows, you can use Sys.setlocale(locale = "Russian");
  2. If you are on Windows, is your Windows administrative locale set to Russian? (Sometimes changing the locale solves some problems).
  3. Could you provide us with a minimal reproducible example of R code (please, follow these guidelines)?
  4. You may try to transliterate your Cyrillic names into Latin ones, e.g., this example.

@Hadsga
Copy link
Author

Hadsga commented Apr 15, 2019

  1. I use Sys.setlocale("LC_ALL","Russian").
  2. I am not located in Russia, so I don´t want to change the region. However, for a short period, it´s no problem.
dput(sample_data)
structure(list(city = c("Сергиев", "Новосибирск", "Красноярск", 
"Химки", "Курск", "Москва", "Волжский", "Уфа", "Коломна", "Москва"
), item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

dummy_data = createDummyFeatures(sample_data, target = "item_cnt_month")
task = makeRegrTask(data = dummy_data, target = "item_cnt_month")
  1. That works, thanks.

@GegznaV
Copy link
Contributor

GegznaV commented Apr 15, 2019

I'm working in "English_United States.1252" R locale on Windows 10.
With your example, I got a message like this:

Error in (function (cn, x)  : 
  Unsupported feature type (character) in column 'city'.

And lines solved the issue:

sample_data$city <- as.factor(sample_data$city)   # convert to factor
sample_data <- as.data.frame(sample_data)         # convert to pure data frame

The next problem was missing values in the target variable:

Error in mlr::makeRegrTask(data = dummy_data, target = "item_cnt_month") : 
  Assertion on 'item_cnt_month' failed: Contains missing values (element 7).

And this is not a language-dependent issue.

But if I switch to Russian locale, I can reproduce your issue. Have switching the administrative locale to Russian helped you? If you are working in the Russian language, it worth doing it.

@Hadsga Hadsga changed the title Tasks don´t accept foreign language Tasks doesn´t accept foreign language Apr 15, 2019
@Hadsga Hadsga changed the title Tasks doesn´t accept foreign language Task doesn´t accept foreign language Apr 15, 2019
@Hadsga
Copy link
Author

Hadsga commented Apr 16, 2019

My bad. In the original data city is converted into a factor and there are no NAs in the target variable.
However, if I change administrative locale the column city looks like this (this is done without Sys.setlocale(locale = "Russian")) :

<U+0422><U+044E><U+043C><U+0435><U+043D><U+044C>

If I try to use Sys.setlocale(locale = "Russian") (i.e. without "LC_ALL") I get this error:

Error in Sys.setlocale("Russian") : invalid 'category' argument

I think the best option is to use your suggestion number 4.

@GegznaV
Copy link
Contributor

GegznaV commented Apr 16, 2019

I summarized the main aspects of the conversation above by this example. I turned my Windows locale to Russian. Then used R setup:

library(mlr)
Sys.setlocale(locale = "Russian")

If certain words, such as "Красноярск" (Krasnoyarsk city), are included as column names, as in this example:

dummy_data = structure(list(
  item_cnt_month   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
  city.Красноярск  = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0)
), 
class = "data.frame", 
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))

task = makeRegrTask(data = dummy_data, target = "item_cnt_month")

Then makeRegrTask() fails with error:

Error in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Assertion on 'data' failed: Columns must be named according to R's variable naming
conventions and may not contain special characters.

If all dataset, provided by @Hadsga, is used except the line with "Красноярск", the code works as expected:

dummy_data = structure(list(
  item_cnt_month   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
  city.Волжский    = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
  city.Коломна     = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0), 
  # city.Красноярск  = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
  city.Курск       = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0), 
  city.Москва      = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
  city.Новосибирск = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
  city.Сергиев     = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  city.Уфа         = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0), 
  city.Химки       = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
  ), 
  class = "data.frame", 
  row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))

task = makeRegrTask(data = dummy_data, target = "item_cnt_month")

@pat-s, doesn't it seem like an encoding issue?

And doesn't it seem similar to Rapporter/pander#296 (to sum up: after R 3.4.0 was released, some encoding issues appeared and it was impossible to use the package with data in certain languages), which persisted for almost two years until it was solved by adding enc2native() in certain lines of code (see Rapporter/pander@06c2f65 for details). These are just my ideas.

@pat-s
Copy link
Member

pat-s commented Apr 24, 2019

@GegznaV Thanks for all your time here. TBH, I have not much experience with encoding and dealing with non-latin characters is out of scope here for us.

If there is an easy canonical fix I am happy to take a look at it.

Closing here since the issue is not really related to mlr.

@pat-s pat-s closed this as completed Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants