Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify .cluster file data format #21

Merged
merged 2 commits into from
Jul 1, 2020

Conversation

who3411
Copy link
Contributor

@who3411 who3411 commented Jun 27, 2020

Purpose

Fix #20 .

How to Fix (see detail b2667d7 commit)

To fix this issue, unique messages cluster number needs to be mapped to all messages cluster number. Unique messages can be made from prisma.R’s function duplicateRemover’s variable uniqueClasses. And, pulsar/core/cluster_generator.R’s variable names(data$remapper) relates uniqueClasses(data$remapper relates all messages).

My proposed correction procedure is as follow (Please take a look at PR sent later for more information):

  1. Use names(data$remapper) and uniqueClasses to map unique messages to all messages(It's called lines).
  2. Use uniqueClasses and clusters to map unique messages cluster number to all messages cluster number(It's called lineClusters).
  3. Write lineClusters to .cluster file.

uniqueClasses is local variable in duplicateRemover function, so it recalculate in pulsar/core/cluster_generator.R.

Result

> #capture_dir = “models/itunes-xbmc/itunes-xbmc”
…
> clusters = calcDatacluster(pmf)
> clusters
  line2  line98 line167 line563 line787 line273 line487 line451 line173 line181 
      1       6      10       5       4       3       1       4      10      10 
line569 line577 line793 line801 line493 line501 line457 line471 line465 line177 
      5       5       7       7       1       7       7       7       7      10 
line171 line169 line573 line567 line565 line797 line791 line497 line461 line277 
     10      10       5       5       5       7       2       1       5       3 
line491 line455 line789 line275 line185 line583 line805 line507 line469  line21 
      1       2       4       3      10       5       2       1       2       9 
line677 line291 line489 line453   line1 line165 line785 line271 line449 line485 
      2       2       1       4       4       4       4       4       4       4 
line561 line183 line289 line467 line503 line579  line97   line3 line671 line653 
      4       4       4       4       4       4       4       4       4       4 
  line5  line11  line19  line15   line9   line7  line17 line179 line575 line799 
      9       9       9       9       9       9       9      10       5       7 
line499 line463 line285 line667 line279 line287 line283 line655 line661 line669 
      1       5       3       8       3       3       3       8       8       8 
line777 line665 line659 line657  line10   line6   line8 
      8       8       8       8       6       6       6 
> 
…
> lines = sapply(names(data$remapper), function(x) colnames(data$data)[match(x, uniqueClasses)])
> lineClusters = sapply(lines, function(x) clusters[match(x, names(clusters))])
> names(lineClusters) = paste("line", 1:length(lineClusters), sep="")
> lineClusters
  line1   line2   line3   line4   line5   line6   line7   line8   line9  line10 
      4       1       4       1       9       6       9       6       9       6 
 line11  line12  line13  line14  line15  line16  line17  line18  line19  line20 
      9       6       9       6       9       6       9       6       9       6 
 line21  line22  line23  line24  line25  line26  line27  line28  line29  line30 
      9       6       9       6       9       6       9       6       9       6 
 line31  line32  line33  line34  line35  line36  line37  line38  line39  line40 
      9       6       9       6       9       6       9       6       9       6 
…
line791 line792 line793 line794 line795 line796 line797 line798 line799 line800 
      2       6       7       6       7       6       7       6       7       6 
line801 line802 line803 line804 line805 line806 line807 line808 line809 line810 
      7       6       7       6       2       6       7       6       2       6 
line811 line812 line813 line814 line815 line816 line817 line818 line819 line820 
      7       6       7       6       7       6       7       6       2       6 
line821 line822 
      2       1 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.cluster file need to be modified to relate the message with the cluster number
2 participants