Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish Hypercane GUI script for sample action #41

Closed
shawnmjones opened this issue Aug 10, 2021 · 3 comments
Closed

Finish Hypercane GUI script for sample action #41

shawnmjones opened this issue Aug 10, 2021 · 3 comments

Comments

@shawnmjones
Copy link
Member

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

@shawnmjones
Copy link
Member Author

This work can not truly be completed until other work is done because many of the algorithms run by sample require identify (#44), score (#42), order (#43), cluster (#45), and filter (#47).

@shawnmjones
Copy link
Member Author

shawnmjones commented Sep 10, 2021

At this point, sample supports the following (not completely tested) algorithms out of the box:

# hc sample --help                                                                                                                                                                                                                                                        
usage: hc sample [-h] {DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample} ...

'sample' produces a list of exemplars from a collection by applying an existing algorithm

positional arguments:
  {DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample}
                        sampling methods
    DSA1                An implementation of the algorithm from AlNoamany's dissertation.
    DSA2                An implementation of the DSA2 algorithm from Jones' dissertation.
    DSA3                An implementation of the DSA3 algorithm from Jones' dissertation.
    DSA4                An implementation of the DSA4 algorithm from Jones' dissertation.
    filtered-random     Filter the collection for off-topic mementos and exclude near duplicates before randomly sampling from remainder.
    order-by-memento-datetime-then-systematically-sample
                        Select exemplars from a web archive collection by first ordering a colleciton, then systematically sampling every jth memento from the remainder.
    simple-search-engine
                        Search for mementos with a specific pattern, score results by BM25, order by descending score.
    true-random         sample probabilistically by randomly sampling k mementos from the input
    systematic          returns every jth memento from the input
    stratified-random   returns j items randomly chosen from each cluster, requries that the input be clustered with the cluster action
    stratified-systematic
                        returns every jth URI-M from each cluster, requries that the input be clustered with the cluster action
    random-cluster      return j randomly selected clusters from the sample, requires that the input be clustered with the cluster action
    random-oversample   randomly duplicates URI-Ms in the smaller clusters until they match the size of the largest cluster, requires input be clustered with the cluster action
    random-undersample  randomly chooses URI-Ms from the larger clusters until they match the size of the smallest cluster, requires input be clustered with the cluster action

optional arguments:
  -h, --help            show this help message and exit

The arguments for these all appear in Wooey, so it looks like sample works properly in the GUI as well.

I developed a method of annotating BASH scripts with some JSON so that Hypercane is aware of the arguments supported by the BASH script. This seems to have worked well. I will not implement any more algorithms until after we have tested more with NLA.

@shawnmjones
Copy link
Member Author

This works now that caching is enabled. Closing.

@shawnmjones shawnmjones moved this from In progress to In Review in IIPC 2021 Grant - Dark and Stormy Archives Sep 18, 2021
shawnmjones added a commit that referenced this issue Sep 28, 2021
shawnmjones added a commit that referenced this issue Sep 28, 2021
shawnmjones added a commit that referenced this issue Sep 28, 2021
shawnmjones added a commit that referenced this issue Sep 28, 2021
shawnmjones added a commit that referenced this issue Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

1 participant