-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bi-grams on polish texts #63
Comments
Looks like that is working just fine! library(tidyverse)
library(tidytext)
text <- c("Marek jest w galerii sztuki. On stoi przy ścianie i patrzy na obraz.",
"Marek patrzy na obraz od kilku minut. Obraz wisi na ścianie.",
"Obok obrazu wiszą tabliczki z informacją.",
"Jedna tabliczka jest po polsku a druga jest po angielsku.",
"Ściana jest biała a obraz jest kolorowy.",
"Na obrazie są: niebieskie niebo, szare chmury i zielona łąka pokryta kolorowymi plamami.",
"Kolorowe plamy to kwiaty.",
"Kwiaty są wszędzie i mają różne kolory: czerwony, żółty, pomarańczowy, fioletowy i różowy.",
"Pośrodku łąki rośnie drzewo. Drzewo jest duże i stare.",
"Pod drzewem stoją małe dzieci, dziewczynka i chłopiec.",
"Dzieci patrzą na drzewo. Na drzewie siedzi kot. Kot patrzy na dzieci.")
data_frame(text) %>%
unnest_tokens(word, text, token = "ngrams", n = 2)
#> # A tibble: 105 x 1
#> word
#> <chr>
#> 1 marek jest
#> 2 jest w
#> 3 w galerii
#> 4 galerii sztuki
#> 5 sztuki on
#> 6 on stoi
#> 7 stoi przy
#> 8 przy ścianie
#> 9 ścianie i
#> 10 i patrzy
#> # ... with 95 more rows Did you have a specific problem? |
It's OK when you write text manually or take it from *.txt file or Twitter. But try this:
|
That's because you are doing |
Order is not a problem. Problem is polish characters - I took sample words ("łyżka" = spoon) to show how they look. Some code one more time:
Look at "dość" in first table (pos 4) and "doĺ›ä‡ trening" in second table (pos 3). dość and doĺ›ä‡ this is the same word (4th in whole message) - correct should be "dość" (eng. "enough"). Same with "końcu" and "który" (8 and 9 in 1st table, 7th in second). FB post link: https://www.facebook.com/chodakowskaewa/videos/1548694051870966/ |
Hmmm, neither @dgrtwo nor I can reproduce this problem: library(Rfacebook)
library(tidyverse)
library(tidytext)
fb_page_posts <- getPage("chodakowskaewa", token=token, n=1)
#> 1 posts
fb_page_posts %>%
unnest_tokens(text, message, token="words") %>%
count(text, sort = TRUE)
#> # A tibble: 86 × 2
#> text n
#> <chr> <int>
#> 1 miłość 4
#> 2 nie 4
#> 3 o 4
#> 4 i 3
#> 5 przy 3
#> 6 cię 2
#> 7 łzy 2
#> 8 ma 2
#> 9 miłości 2
#> 10 te 2
#> # ... with 76 more rows
fb_page_posts %>%
unnest_tokens(text, message, token="ngrams", n=2) %>%
count(text, sort = TRUE)
#> # A tibble: 104 × 2
#> text n
#> <chr> <int>
#> 1 o tym 2
#> 2 a część 1
#> 3 bez miłości 1
#> 4 bo wciąż 1
#> 5 ból prawdziwa 1
#> 6 boli miłość 1
#> 7 bólu bo 1
#> 8 byłyby nie 1
#> 9 celowym zadawaniu 1
#> 10 chwile zapewne 1
#> # ... with 94 more rows This is definitely some kind of encoding problem specific to something in your setup. I wonder if it related to the locale you have set in R? This is literally just a guess on my part. It appears not to be a problem with tidytext, the tokenizers package, or the packages either of them wrap, though. Looks like it is all working on our end, which I know is frustrating for you, unfortunately. |
Thanks. I've tried to convert messages from utf8 to latin2, latin2 to utf8 and the same with cp1250 (I'm using polish Windows) and it didn't help. I try to set locale in R into something else (utf8 should be best choice, right?) and try to fight the problem. Now we can close this issue. Thanks for help :) |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Please try to unnest_tokens() to make bi-grams with posts from Facebook, expecialy polish posts.
The text was updated successfully, but these errors were encountered: