Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bi-grams on polish texts #63

Closed
prokulski opened this issue May 22, 2017 · 7 comments
Closed

bi-grams on polish texts #63

prokulski opened this issue May 22, 2017 · 7 comments

Comments

@prokulski
Copy link

Please try to unnest_tokens() to make bi-grams with posts from Facebook, expecialy polish posts.

@juliasilge
Copy link
Owner

Looks like that is working just fine!

library(tidyverse)
library(tidytext)

text <- c("Marek jest w galerii sztuki. On stoi przy ścianie i patrzy na obraz.",
          "Marek patrzy na obraz od kilku minut. Obraz wisi na ścianie.", 
          "Obok obrazu wiszą tabliczki z informacją.", 
          "Jedna tabliczka jest po polsku a druga jest po angielsku.", 
          "Ściana jest biała a obraz jest kolorowy.", 
          "Na obrazie są: niebieskie niebo, szare chmury i zielona łąka pokryta kolorowymi plamami.",
          "Kolorowe plamy to kwiaty.",
          "Kwiaty są wszędzie i mają różne kolory: czerwony, żółty, pomarańczowy, fioletowy i różowy.",
          "Pośrodku łąki rośnie drzewo. Drzewo jest duże i stare.",
          "Pod drzewem stoją małe dzieci, dziewczynka i chłopiec.", 
          "Dzieci patrzą na drzewo. Na drzewie siedzi kot. Kot patrzy na dzieci.")

data_frame(text) %>%
    unnest_tokens(word, text, token = "ngrams", n = 2)
#> # A tibble: 105 x 1
#>              word
#>             <chr>
#>  1     marek jest
#>  2         jest w
#>  3      w galerii
#>  4 galerii sztuki
#>  5      sztuki on
#>  6        on stoi
#>  7      stoi przy
#>  8   przy ścianie
#>  9      ścianie i
#> 10       i patrzy
#> # ... with 95 more rows

Did you have a specific problem?

@prokulski
Copy link
Author

It's OK when you write text manually or take it from *.txt file or Twitter. But try this:

#devtools::install_github("pablobarbera/Rfacebook/Rfacebook")
library(Rfacebook)
library(tidyverse)
library(tidytext)

# Request Acces token via:
# https://developers.facebook.com/tools/explorer/
token <- "EAACEdEose0cBAFwOZA7ib0kNTaL1Uj6DsROTZAgcftvfYTWZBZCWOWf2ToHIjhOMMak4XqPIjJlSFNDGI7l1SMrM05ZBXWZB9Uk5QqosLZAA66MGmwfzXAUw7UvBUHtrb2ZAenS0z3AwBQFJ9wZCPXim2r0wCwYrRem4avvYYUn9i5UMTV3rl8wltZCP0rJIdZCslUZD"

fb_page_posts <- getPage("pisorgpl", token=token, n=10)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 10th is "się"

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 10th is "buduje się"

# another fb page:
fb_page_posts <- getPage("chodakowskaewa", token=token, n=10)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 7th is "łyżka"

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# 1st - "łyżka 10g"
# 2nd - "2 łyżki"

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 23, 2017

That's because you are doing count(text) and arrange(desc(n)), which changes the order. There is no reason to expect the resulting n-grams to be consecutive or for the order to be related to the order of the 1-gram tokens. Try select(text) instead of count(text) %>% arrange(desc(n)) and you'll get the output you are expecting.

@prokulski
Copy link
Author

Order is not a problem. Problem is polish characters - I took sample words ("łyżka" = spoon) to show how they look.

Some code one more time:

fb_page_posts <- getPage("chodakowskaewa", token=fb_oauth, n=1)

fb_page_posts %>%
   unnest_tokens(text, message, token="words") %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# A tibble: 10 × 2
         text     n
        <chr> <int>
1          bo     1
2          co     1
3    dlaczego     1
4        dość     1
5           i     1
6  inspiracji     1
7        jest     1
8       końcu     1
9       który     1
10         mi     1

fb_page_posts %>% 
   unnest_tokens(text, message, token="ngrams", n=2) %>%
   count(text) %>%
   arrange(desc(n)) %>%
   head(10)

# A tibble: 10 × 2
               text     n
              <chr> <int>
1     bo inspiracji     1
2         co widaä‡     1
3    doĺ›ä‡ trening     1
4        i dlaczego     1
5  inspiracji nigdy     1
6        jest twoim     1
7     koĺ„cu ktăłry     1
8          ktăłry z     1
9          mi wiele     1
10 moich programăłw     1

fb_page_posts$message

[1] "Bo inspiracji nigdy dość ❤️❤️❤️ Trening zawsze sprawia mi wiele radości co widać na końcu \xed��\xed�\u0082\xed��\xed�\u0082\n\nKtóry z moich programów jest Twoim ulubionym? \nI dlaczego? \xed��\xed�\u0080\xed��\xed�\u0080"

Look at "dość" in first table (pos 4) and "doĺ›ä‡ trening" in second table (pos 3). dość and doĺ›ä‡ this is the same word (4th in whole message) - correct should be "dość" (eng. "enough").

Same with "końcu" and "który" (8 and 9 in 1st table, 7th in second).

FB post link: https://www.facebook.com/chodakowskaewa/videos/1548694051870966/

@juliasilge
Copy link
Owner

Hmmm, neither @dgrtwo nor I can reproduce this problem:

library(Rfacebook)
library(tidyverse)
library(tidytext)

fb_page_posts <- getPage("chodakowskaewa", token=token, n=1)
#> 1 posts

fb_page_posts %>%
    unnest_tokens(text, message, token="words") %>%
    count(text, sort = TRUE)

#> # A tibble: 86 × 2
#>       text     n
#>      <chr> <int>
#> 1   miłość     4
#> 2      nie     4
#> 3        o     4
#> 4        i     3
#> 5     przy     3
#> 6      cię     2
#> 7      łzy     2
#> 8       ma     2
#> 9  miłości     2
#> 10      te     2
#> # ... with 76 more rows

fb_page_posts %>% 
    unnest_tokens(text, message, token="ngrams", n=2) %>%
    count(text, sort = TRUE)

#> # A tibble: 104 × 2
#>                 text     n
#>                <chr> <int>
#> 1              o tym     2
#> 2            a część     1
#> 3        bez miłości     1
#> 4           bo wciąż     1
#> 5      ból prawdziwa     1
#> 6        boli miłość     1
#> 7            bólu bo     1
#> 8         byłyby nie     1
#> 9  celowym zadawaniu     1
#> 10    chwile zapewne     1
#> # ... with 94 more rows

This is definitely some kind of encoding problem specific to something in your setup. I wonder if it related to the locale you have set in R? This is literally just a guess on my part. It appears not to be a problem with tidytext, the tokenizers package, or the packages either of them wrap, though. Looks like it is all working on our end, which I know is frustrating for you, unfortunately.

@prokulski
Copy link
Author

Thanks. I've tried to convert messages from utf8 to latin2, latin2 to utf8 and the same with cp1250 (I'm using polish Windows) and it didn't help. I try to set locale in R into something else (utf8 should be best choice, right?) and try to fight the problem.

Now we can close this issue. Thanks for help :)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants