-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task doesn´t accept foreign language #2568
Comments
Some general notes:
|
dput(sample_data)
structure(list(city = c("Сергиев", "Новосибирск", "Красноярск",
"Химки", "Курск", "Москва", "Волжский", "Уфа", "Коломна", "Москва"
), item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
dummy_data = createDummyFeatures(sample_data, target = "item_cnt_month")
task = makeRegrTask(data = dummy_data, target = "item_cnt_month")
|
I'm working in "English_United States.1252" R locale on Windows 10.
And lines solved the issue:
The next problem was missing values in the target variable:
And this is not a language-dependent issue. But if I switch to Russian locale, I can reproduce your issue. Have switching the administrative locale to Russian helped you? If you are working in the Russian language, it worth doing it. |
My bad. In the original data
If I try to use
I think the best option is to use your suggestion number 4. |
I summarized the main aspects of the conversation above by this example. I turned my Windows locale to Russian. Then used R setup: library(mlr)
Sys.setlocale(locale = "Russian") If certain words, such as "Красноярск" (Krasnoyarsk city), are included as column names, as in this example: dummy_data = structure(list(
item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
city.Красноярск = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0)
),
class = "data.frame",
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))
task = makeRegrTask(data = dummy_data, target = "item_cnt_month") Then
If all dataset, provided by @Hadsga, is used except the line with "Красноярск", the code works as expected: dummy_data = structure(list(
item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
city.Волжский = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
city.Коломна = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0),
# city.Красноярск = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
city.Курск = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
city.Москва = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
city.Новосибирск = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
city.Сергиев = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
city.Уфа = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
city.Химки = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
),
class = "data.frame",
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))
task = makeRegrTask(data = dummy_data, target = "item_cnt_month") @pat-s, doesn't it seem like an encoding issue? And doesn't it seem similar to Rapporter/pander#296 (to sum up: after R 3.4.0 was released, some encoding issues appeared and it was impossible to use the package with data in certain languages), which persisted for almost two years until it was solved by adding |
@GegznaV Thanks for all your time here. TBH, I have not much experience with encoding and dealing with non-latin characters is out of scope here for us. If there is an easy canonical fix I am happy to take a look at it. Closing here since the issue is not really related to mlr. |
I have a data set with Russian column names:
If I try to create a task I get this error:
The text was updated successfully, but these errors were encountered: