bi-grams on polish texts #63

prokulski · 2017-05-22T14:29:50Z

Please try to unnest_tokens() to make bi-grams with posts from Facebook, expecialy polish posts.

juliasilge · 2017-05-22T23:06:45Z

Looks like that is working just fine!

library(tidyverse)
library(tidytext)

text <- c("Marek jest w galerii sztuki. On stoi przy ścianie i patrzy na obraz.",
          "Marek patrzy na obraz od kilku minut. Obraz wisi na ścianie.", 
          "Obok obrazu wiszą tabliczki z informacją.", 
          "Jedna tabliczka jest po polsku a druga jest po angielsku.", 
          "Ściana jest biała a obraz jest kolorowy.", 
          "Na obrazie są: niebieskie niebo, szare chmury i zielona łąka pokryta kolorowymi plamami.",
          "Kolorowe plamy to kwiaty.",
          "Kwiaty są wszędzie i mają różne kolory: czerwony, żółty, pomarańczowy, fioletowy i różowy.",
          "Pośrodku łąki rośnie drzewo. Drzewo jest duże i stare.",
          "Pod drzewem stoją małe dzieci, dziewczynka i chłopiec.", 
          "Dzieci patrzą na drzewo. Na drzewie siedzi kot. Kot patrzy na dzieci.")

data_frame(text) %>%
    unnest_tokens(word, text, token = "ngrams", n = 2)
#> # A tibble: 105 x 1
#>              word
#>             <chr>
#>  1     marek jest
#>  2         jest w
#>  3      w galerii
#>  4 galerii sztuki
#>  5      sztuki on
#>  6        on stoi
#>  7      stoi przy
#>  8   przy ścianie
#>  9      ścianie i
#> 10       i patrzy
#> # ... with 95 more rows

Did you have a specific problem?

prokulski · 2017-05-23T06:53:38Z

It's OK when you write text manually or take it from *.txt file or Twitter. But try this:

#devtools::install_github("pablobarbera/Rfacebook/Rfacebook")
library(Rfacebook)
library(tidyverse)
library(tidytext)

# Request Acces token via:
# https://developers.facebook.com/tools/explorer/
token <- "EAACEdEose0cBAFwOZA7ib0kNTaL1Uj6DsROTZAgcftvfYTWZBZCWOWf2ToHIjhOMMak4XqPIjJlSFNDGI7l1SMrM05ZBXWZB9Uk5QqosLZAA66MGmwfzXAUw7UvBUHtrb2ZAenS0z3AwBQFJ9wZCPXim2r0wCwYrRem4avvYYUn9i5UMTV3rl8wltZCP0rJIdZCslUZD"

fb_page_posts <- getPage("pisorgpl", token=token, n=10)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 10th is "się"

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 10th is "buduje się"

# another fb page:
fb_page_posts <- getPage("chodakowskaewa", token=token, n=10)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 7th is "łyżka"

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 1st - "łyżka 10g"
# 2nd - "2 łyżki"

dgrtwo · 2017-05-23T13:12:39Z

That's because you are doing count(text) and arrange(desc(n)), which changes the order. There is no reason to expect the resulting n-grams to be consecutive or for the order to be related to the order of the 1-gram tokens. Try select(text) instead of count(text) %>% arrange(desc(n)) and you'll get the output you are expecting.

prokulski · 2017-05-23T13:33:12Z

Order is not a problem. Problem is polish characters - I took sample words ("łyżka" = spoon) to show how they look.

Some code one more time:

fb_page_posts <- getPage("chodakowskaewa", token=fb_oauth, n=1)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# A tibble: 10 × 2
         text     n
        <chr> <int>
1          bo     1
2          co     1
3    dlaczego     1
4        dość     1
5           i     1
6  inspiracji     1
7        jest     1
8       końcu     1
9       który     1
10         mi     1

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# A tibble: 10 × 2
               text     n
              <chr> <int>
1     bo inspiracji     1
2         co widaä‡     1
3    doĺ›ä‡ trening     1
4        i dlaczego     1
5  inspiracji nigdy     1
6        jest twoim     1
7     koĺ„cu ktăłry     1
8          ktăłry z     1
9          mi wiele     1
10 moich programăłw     1

fb_page_posts$message

[1] "Bo inspiracji nigdy dość ❤️❤️❤️ Trening zawsze sprawia mi wiele radości co widać na końcu \xed��\xed�\u0082\xed��\xed�\u0082\n\nKtóry z moich programów jest Twoim ulubionym? \nI dlaczego? \xed��\xed�\u0080\xed��\xed�\u0080"

Look at "dość" in first table (pos 4) and "doĺ›ä‡ trening" in second table (pos 3). dość and doĺ›ä‡ this is the same word (4th in whole message) - correct should be "dość" (eng. "enough").

Same with "końcu" and "który" (8 and 9 in 1st table, 7th in second).

FB post link: https://www.facebook.com/chodakowskaewa/videos/1548694051870966/

juliasilge · 2017-05-24T16:54:29Z

Hmmm, neither @dgrtwo nor I can reproduce this problem:

library(Rfacebook)
library(tidyverse)
library(tidytext)

fb_page_posts <- getPage("chodakowskaewa", token=token, n=1)
#> 1 posts

fb_page_posts %>%
    unnest_tokens(text, message, token="words") %>%
    count(text, sort = TRUE)

#> # A tibble: 86 × 2
#>       text     n
#>      <chr> <int>
#> 1   miłość     4
#> 2      nie     4
#> 3        o     4
#> 4        i     3
#> 5     przy     3
#> 6      cię     2
#> 7      łzy     2
#> 8       ma     2
#> 9  miłości     2
#> 10      te     2
#> # ... with 76 more rows

fb_page_posts %>% 
    unnest_tokens(text, message, token="ngrams", n=2) %>%
    count(text, sort = TRUE)

#> # A tibble: 104 × 2
#>                 text     n
#>                <chr> <int>
#> 1              o tym     2
#> 2            a część     1
#> 3        bez miłości     1
#> 4           bo wciąż     1
#> 5      ból prawdziwa     1
#> 6        boli miłość     1
#> 7            bólu bo     1
#> 8         byłyby nie     1
#> 9  celowym zadawaniu     1
#> 10    chwile zapewne     1
#> # ... with 94 more rows

This is definitely some kind of encoding problem specific to something in your setup. I wonder if it related to the locale you have set in R? This is literally just a guess on my part. It appears not to be a problem with tidytext, the tokenizers package, or the packages either of them wrap, though. Looks like it is all working on our end, which I know is frustrating for you, unfortunately.

prokulski · 2017-05-24T17:03:06Z

Thanks. I've tried to convert messages from utf8 to latin2, latin2 to utf8 and the same with cp1250 (I'm using polish Windows) and it didn't help. I try to set locale in R into something else (utf8 should be best choice, right?) and try to fight the problem.

Now we can close this issue. Thanks for help :)

github-actions · 2022-03-26T00:08:45Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

prokulski closed this as completed May 24, 2017

juliasilge mentioned this issue Aug 22, 2017

Encoding problems with unnest_tokens in Windows #80

Closed

This was referenced Mar 2, 2018

error with russian text in ngram tokenisation #104

Closed

Specify encoding in C++ code for skip_ngrams ropensci/tokenizers#58

Closed

github-actions bot locked and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bi-grams on polish texts #63

bi-grams on polish texts #63

prokulski commented May 22, 2017

juliasilge commented May 22, 2017

prokulski commented May 23, 2017

dgrtwo commented May 23, 2017 •

edited

prokulski commented May 23, 2017

juliasilge commented May 24, 2017

prokulski commented May 24, 2017

github-actions bot commented Mar 26, 2022

bi-grams on polish texts #63

bi-grams on polish texts #63

Comments

prokulski commented May 22, 2017

juliasilge commented May 22, 2017

prokulski commented May 23, 2017

dgrtwo commented May 23, 2017 • edited

prokulski commented May 23, 2017

juliasilge commented May 24, 2017

prokulski commented May 24, 2017

github-actions bot commented Mar 26, 2022

dgrtwo commented May 23, 2017 •

edited