Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.cluster file need to be modified to relate the message with the cluster number #20

Closed
who3411 opened this issue Jun 27, 2020 · 7 comments · Fixed by #21 or #22
Closed

.cluster file need to be modified to relate the message with the cluster number #20

who3411 opened this issue Jun 27, 2020 · 7 comments · Fixed by #21 or #22

Comments

@who3411
Copy link
Contributor

who3411 commented Jun 27, 2020

I am trying to understand the implementation of PRISMA and PULSAR for my research. pulsar.core.data.DataHandler implementation about clusterAssignments supposes that .cluster file will relate all messages(line) belong to cluster number but as a matter of fact they are not yet related. As a result, itunes-xbmc does not seem to create model very well.

For instance, the data format of .cluster file which is expected by pulsar.core.data.DataHandler is :

Cluster number belonging to message 1(line 1)
Cluster number belonging to message 2(line 2)
Cluster number belonging to message 3(line 3)
…
Cluster number belonging to message n(line n) (=last message)

But currently, the .cluster file data format is:

Cluster number belonging to message ?(line ?) (=1st unique message's cluster number)
Cluster number belonging to message ?(line ?) (=2nd unique message's cluster number)
Cluster number belonging to message ?(line ?) (=3rd unique message's cluster number)
….
Cluster number belonging to message ?(line ?) (=last unique message's cluster number)

As a test, I tried to show contents of pulsar/core/cluster_generator.R’s variable, clusters (contents of clusters is written to .cluster file). The result is as follow:

> #capture_dir = “models/itunes-xbmc/itunes-xbmc”
…
> clusters = calcDatacluster(pmf)
> clusters
  line2  line98 line167 line563 line787 line273 line487 line451 line173 line181 
      1       6      10       5       4       3       1       4      10      10 
line569 line577 line793 line801 line493 line501 line457 line471 line465 line177 
      5       5       7       7       1       7       7       7       7      10 
line171 line169 line573 line567 line565 line797 line791 line497 line461 line277 
     10      10       5       5       5       7       2       1       5       3 
line491 line455 line789 line275 line185 line583 line805 line507 line469  line21 
      1       2       4       3      10       5       2       1       2       9 
line677 line291 line489 line453   line1 line165 line785 line271 line449 line485 
      2       2       1       4       4       4       4       4       4       4 
line561 line183 line289 line467 line503 line579  line97   line3 line671 line653 
      4       4       4       4       4       4       4       4       4       4 
  line5  line11  line19  line15   line9   line7  line17 line179 line575 line799 
      9       9       9       9       9       9       9      10       5       7 
line499 line463 line285 line667 line279 line287 line283 line655 line661 line669 
      1       5       3       8       3       3       3       8       8       8 
line777 line665 line659 line657  line10   line6   line8 
      8       8       8       8       6       6       6 

Under the influence of now .cluster file data format, many messages don’t relate to cluster number. To fix this issue, unique messages cluster number needs to be mapped to all messages cluster number. Unique messages can be made from prisma.R’s function duplicateRemover’s variable uniqueClasses. And, pulsar/core/cluster_generator.R’s variable names(data$remapper) relates uniqueClasses(data$remapper relates all messages).

My proposed correction procedure is as follow (Please take a look at PR sent later for more information → #21 ):

  1. Use names(data$remapper) and uniqueClasses to map unique messages to all messages(It's called lines).
  2. Use uniqueClasses and clusters to map unique messages cluster number to all messages cluster number(It's called lineClusters).
  3. Write lineClusters to .cluster file.

Thanks for taking your time reading this.

I am not a native speaker so some of my expression might not be accurate. Sorry for this inconvenience.

@hgascon
Copy link
Owner

hgascon commented Jun 30, 2020

Hi @who3411, thanks for your interest and PR.
Are you saying that some messages are not assigned to a cluster? Why is this a problem?

@who3411
Copy link
Contributor Author

who3411 commented Jul 1, 2020

@hgascon Thank you for your reply.

Are you saying that some messages are not assigned to a cluster?

Yes. All messages are assigned to a cluster in PRISMA. But, some messages are not assigned to a cluster in cluster_generator.R's variable clusters. Accurately, unique messages are assigned to a cluster in cluster_generator.R's variable clusters.

Why is this a problem?

None cluster appears. In my environment, there are 822 messages in itunes-xbmc.pcap. But, There aren't 822 messages in .cluster file. So, many messages are mapped None cluster.

pulsar.core.DataHandler.clusterAssignments needs to map cluster number to all messages. But, now implementation don't map cluster number to all messages. pulsar.core.DataHandler.clusterAssignments is processed in pulsar.core.DataHandler._readClusterAssignments.

def _readClusterAssignments(self):
    path = "%s.cluster" % self.datapath
    if not os.path.exists(path):
        print "Error during clustering (not enough data?)"
        print "Cluster file not generated:", path
        print "Exiting learning module..."
        sys.exit(1)

    def clusterProcessor(clusterRow):
        return clusterRow[0]
    self.clusterAssignments = self._processData(path, clusterProcessor,
                                                self.N, skipFirstLine=False)
    assert(len(self.clusterAssignments) == self.N)
    self.Ncluster = len(set(self.clusterAssignments))

pulsar.core.DataHandler._processData maps line numbers to cluster numbers in .cluster file. At this point, .cluster file should be 822 lines. But, now .cluster file is less than 822 lines. So, None cluster appears. Because part of implementation of pulsar.core.DataHandler._processData is:

def _processData(self, fname, process, init, skipFirstLine=False):
    f = file(fname, "r")
    data = csv.reader(f, delimiter="\t", quotechar=None, escapechar=None)
    if init is None:
        res = []
    else:
        res = [None] * init

@who3411
Copy link
Contributor Author

who3411 commented Jul 1, 2020

Sorry, There is supplement in the comment that I sent earlier.

Why is this a problem?

None cluster appears, and incorrect markov model is made. As I mentioned before, None cluster appears because now .cluster file is less than 822 lines. But practically, None cluster is not exists. In addition, .cluster file don't map cluster number to all messages. For this reasons, incorrect markov model is made.

@hgascon
Copy link
Owner

hgascon commented Jul 5, 2020

It seems that your data has many duplicates, which for efficient reasons
are filtered out (see duplicateRemover). If you want to have a
one-to-one correspondence later, you need to "explode" the labels back
to the original size... which is already done by the public method
getMatrixFactorizationLabels. The method calcDatacluster is private and,
thus, not listed in the documentation of the Prisma package.

@who3411
Copy link
Contributor Author

who3411 commented Jul 6, 2020

Thank you for the valuable information.
I overlooked the public method getMatrixFactorizationLabels. Sorry for being unfamiliar with the GitHub, should I resend new PR used getMatrixFactorizationLabels?

@hgascon
Copy link
Owner

hgascon commented Jul 7, 2020

Yes, please do.

@who3411
Copy link
Contributor Author

who3411 commented Jul 8, 2020

I resend new PR #22 used getMatrixFactorizationLabels. I am sorry to trouble you, but I would really appreciate it if you could confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants