Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sped up mendel errors with sampleKidRole #68

Merged
merged 20 commits into from
Nov 17, 2015
Merged

sped up mendel errors with sampleKidRole #68

merged 20 commits into from
Nov 17, 2015

Conversation

jbloom22
Copy link
Contributor

@jbloom22 jbloom22 commented Nov 4, 2015

on profile225, decreased from 3m12s to 1m52s

@cseed
Copy link
Collaborator

cseed commented Nov 7, 2015

Timing?

@jbloom22
Copy link
Contributor Author

jbloom22 commented Nov 7, 2015

First test-passing version with Array[Array[GenotypeType]] drops time to 1m37s, not as much as expected. I'll look more closely to see if there are obvious inefficiencies, and may try unboxing GenotypeType to Int. If you look, let me know if you see something obvious.

Output looks good. But there are only 6 complete trios in profile225, so after filtering out samples not in complete trios in .filterSamples(isTrioSample) , there is very little data to process.

Here is the contents of profile225.fmendel:

FID PAT MAT CHLD    N
VN049   HG02026 HG02025 1   2009
SH074   HG00656 HG00657 1   5669
m009    NA19679 NA19678 1   3953
m008    NA19661 NA19660 1   5240
Y117    NA19239 NA19238 1   6499
PR05    HG00731 HG00732 1   1506

@cseed
Copy link
Collaborator

cseed commented Nov 7, 2015

Hmm, that's disappointing. The filterSamples seems unnecessary now: if the sample doesn't appear in a trio the list of roles will be empty. Can you see how it does without that?

@cseed
Copy link
Collaborator

cseed commented Nov 8, 2015

I created a "nothing" commands that counts the number of genotypes by doing a aggregateByVariants followed by an aggregate:

val n = state.vds.aggregateByVariantWithKeys[Int](0)(
      (acc, v, s, g) => acc + 1,
      _ + _)
    .aggregate(0)({ case (acc, (v, t)) => acc + t }, _ + _)

This takes around 1m24s (+/- 1s) on my machine on profile225 and mendelerrors takes about 1m24s (+/-). So we're at the theoretical minimum. I'll review it for style and clarity (but maybe not until Monday).

We should get a dataset with more trios.

@@ -10,6 +10,8 @@ class MendelErrorsSuite extends SparkSuite {
val ped = Pedigree.read("src/test/resources/sample_mendel.fam", vds.sampleIds)
val men = MendelErrors(vds, ped)

men.mendelErrors.collect().foreach(println)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jbloom22
Copy link
Contributor Author

I've done some extensive remodeling of Pedigree and MendelErrors, shorter and conceptually cleaner now, got to delete a bunch of code. But I'm having a serialization issue, which may be related to changing MendelError to include the CompleteTrio rather than the sample. For example, if I replace "implicatedSample" by pasting the body in the closure instead, then the serialization error at that point goes away. but there are a bunch of other ones from the toLine below.

org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
    at org.apache.spark.rdd.RDD.map(RDD.scala:286)
    at org.broadinstitute.hail.methods.MendelErrors.writeMendel(MendelErrors.scala:143)
    at org.broadinstitute.hail.methods.MendelErrorsSuite.test(MendelErrorsSuite.scala:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:696)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
    at org.testng.TestRunner.privateRun(TestRunner.java:767)
    at org.testng.TestRunner.run(TestRunner.java:617)
    at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
    at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343)
    at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305)
    at org.testng.SuiteRunner.run(SuiteRunner.java:254)
    at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
    at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
    at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
    at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
    at org.testng.TestNG.run(TestNG.java:1057)
    at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72)
    at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:122)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
    ... 33 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
    at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
    ... 49 more

@jbloom22
Copy link
Contributor Author

Fixed serialization issues. Current runtime on profile225 is 1m35s.


object FamSummaryCommand extends Command {
def name = "famsummary"
def description = "Summarize a .fam file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "FamSummary" is fine here, I think.

@cseed
Copy link
Collaborator

cseed commented Nov 12, 2015

That's it. Back to you.

@jbloom22
Copy link
Contributor Author

The reason you didn't see those others fixed before is that I had mistakenly not committed all changes (that's why the test failed too). I am using Array.groupBy, repushed, back to you.

@jbloom22
Copy link
Contributor Author

Back to you Cotton!

if (code == 2 || code == 1) List(trio.kid, trio.dad, trio.mom)
else if (code == 6 || code == 3) List(trio.kid, trio.dad)
else if (code == 4 || code == 7 || code == 9 || code == 10) List(trio.kid, trio.mom)
else List(trio.kid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return an iterator and each case can return Iterator(trio.kid, ...).

@cseed
Copy link
Collaborator

cseed commented Nov 16, 2015

Some more minor stuff. Otherwise, ready to merge in.

@jbloom22
Copy link
Contributor Author

Made all changes. Thanks!

@cseed
Copy link
Collaborator

cseed commented Nov 17, 2015

I don't see a bunch of the changes. Push problem? Also, there is now both FamSummary and FamSummaryCmd files.

cseed added a commit that referenced this pull request Nov 17, 2015
sped up mendel errors with sampleKidRole
@cseed cseed merged commit 67500bc into master Nov 17, 2015
@cseed cseed deleted the jb_mendel_v2 branch November 17, 2015 19:12
danking added a commit to danking/hail that referenced this pull request Sep 24, 2018
* repos might be duplicated, so use a set

* update deployment

* only log when there is something to forget
danking added a commit that referenced this pull request Sep 24, 2018
* initial commit

* approved check seems to work

* add environment

* readme with helpful tips

* more developer info

* better error message when oauth-token is missing

* more dev hints

* abstract over repo

* add git ignore

* make amenable to packaging

* add design doc

* bunch of changes

* fixes

* note about secret

* add test

* document envs

* update design doc

* a bunch of fixes

* notes about service account iam

* update hashes to facilitate a test

* pull batch server url into a const

* fake pr build script

* setup for secrets

* fix pr-build-script

* document new batch volumes

* some halfway working setup

* wip

* refactor to state manipulate + heal model

* a variety of bugs and a couple todos

* i think this actually sets the metadata?

* bunch of fixes, introduce polling thread

* locks and fix github state check

* no threading, restore popped status

* done?

* real build script

* fix message about all prs being tested or running

* first attempt at dockerfile

* env variables, move secrets, fix invalidated prs logic

* get docker file working mostly

* fix pr-build-script

* two little fixes

* update dockerignore

* expose port 5000

* expose to the world

* be resilient to bad or out of date job ids

* reduce cpu request

* first pass on deployment

* add info about external ips

* fix pycache ignore

* fix test report url

* refresh from batch, catch cancel fails, slow down refresh interval

* logging

* fix dockerignore

* give user ownership of current directory as well

* a little clean up and better logic for refreshing batch state

* fix bad variable reference in job cancellation during github state refresh

* do not blow away review state on push

* note which target we are talking about when all prs are tested or running

* be resilient to jobs not existing

* more logging in test_pr

* convert to a format string

* start publishing bundles

* Revert "start publishing bundles"

This reverts commit 63b2036.

* watched repos is customizable

* fixes

* shit in an attempt to use docker but ugh

* note firewalls rules thing

* fix url to repo RE also ignore unwatched repos

* a few fixes

* ignore pycache

* add makefile

* ignore more stuff

* new tests

* fix bad variable reference

* note buildable repos

* work to generalize from just the hail repo

* fix buildable repos

* better debugging into when test fails

* fix tests

* fix build script, TARGET_BRANCH might be ambiguous

* be robust to missing artifacts

* properly log exception

* use copy, drop job_id when cancelled job

* quietly throw away orphan heads

* tag images with git sha

* cancel jobs on push, fix copy, and give merge a message

* fix build script oop

* add test with master change

* add test for approved prs

* add dependency on pytest

* clear job id when setting state to pending

* fix deployment.yaml env var

* need to re-fetch origin as well

* better assertion failure messages for dict subset

* add description of github-tokens directory

* ignore temporary git directory created by tests

* ignore the testing oauth tokens

* ignore gcloud token as well

* sort .gitignore

* also ignore .cache and .pytest_cache

* many improvements and fixes to tests

* god, SHUT up!

* fix new token system

* fixes and add failing test for mergedness

* clean up get/post logic

* lots of fixes

* self. on every field

* add more sophisticated matching

* merged_and_old wrong identation

* fix a couple bugs in new match stuff

* fix match again

* add json_response to *_repo

* add json_response

* ad dlog, fix gc syntax errors

* work towards fixing push test since branch is protected now

* more info on testing

* better error messages and fix bug

* bunch of fixes

* add fixme

* fixes

* fix

* mor elogging around github state refresh

* a couple fixes to the tests

* more logging, less logging, use VERSION_FOR_TYPE

VERSION_FOR_TYPE is used to version the pod type attribute so other
CI instances dont stomp on my pods

* add gc field to Status.to_json

* clean up PRs for targets with no PRs in GH

* better logging, fix dead targets logic

* fix confusing double definition

* fix syntax

* rely on finally so we dont have so many errors

* fix make file for docker tag

* add auto merge flag and dont blow away merged statuses after push

* remove nesting level in artifacts copy

* fix misspelled variable name

* try to hide git output from log

* actually sleep inbetween polls

* warn is deprecated

* remove VERSION_FOR_TYPE

* update deployment for latest CI image

* stop spamming the log with job logs

* update deployment

* do not hide git stdout and stderr

* configure username and email before merge

* update deployment

* be resilient to merge conflicts in the docker_image

* update deployment

* bugssss

* update deployment

* use merge instead of rebase to avoid issues

* update dployment

* fix deployment

I accidentally used an old commit hash

* also update docker_image on force_retest

* update deployment

* more status logging

* update deployment

* fix missing parens

* update deployment

* more logging and fix reading status in GH refresh

* update deployment

* also check for file not found error

* update deployment

* do not allow max statuses to break batch_state_refresh

* update deployment

* avoid failing when 422 is received

* update deployment

* kind of handle the too many statuses situation?

* update CONTEXT to avoid status limits

* update deployment

* add timeouts to all requests

* add assertion about attributes and target_url

* update deployment

* implement pagination

* update deployment

* do not log link headers, longer timeouts on self check

* update deployment.yaml

* be robust to changes in job attribute schema

* update deployment

* add hail-is/batch to watched repos (#25)

* add healthcheck

* add livenessprobe

* fix bug in forgetting of old PRs

* update deployment

* fix port

* build should fail if artifacts exist but are not copied (#32)

* Cleanup (#35)

* abstraciton!

* wip

wip

remove old ci

remove old tests

rename newci to ci

remove unused funciton

add heal

fixes

rename http

wip

take oldest PR, I think

i think no more caching

there are two different github urls

fixes

fixxxxxes

non global creds setting

fixes

fixes

two tiny fixes

bunch of fixes

run formatter

run formatter

formatting

fix PR.from_json

formatting, PEP8

formatting and fixes

fixes

lots of fixes

never return none

relax and strengthen assertions

fixes

thread json_response throug http_helper

get oauth tokens right

fix user

fix token

fixes

passes tests

naming

deploying

fixes

deploy script

more deploy things

add healthcheck

add liveness probe to deployment

sleep before checking for artifact

deployability is configurable

also fix pr build script in new branch

rename constants and real_constants

variety of cleanups to get test-locally.sh working again

bugs

fix bug: GitHubPR has no .target

* try to spam the log less

* aggressively use short_str in log statements

* trailing semi-colons visually distinguish the end

* unify logging a bit

* make jobs logging nicer

* fix parse error

* do not duplicate messages

* forgot to include json

* few more logging fixups

* a little more context for PR and GHPR

* fix bug: eagerly update build status

If the PR is found during review update before a pull_request push
request is received, we will wait until the next heal to build. Instead,
we should eagerly build now (the later pull_request webhook will not build
because the shas are up to date).
]

* dont deploy non deployable refs also default to SHA_LENGTH

* few more logging cleanups

* move tests to another folder

* update deployment

* also copy in pr-deploy-script (#36)

* update deployment.yaml (#37)

* fix ham finger (#38)

* update deployment (#39)

* Fix NoImage constructor (#40)

* update deployment

* fix NoImage constructor

* update deployment for real (#41)

* more logging around dead prs (#42)

* more logging around dead prs

* update deployment.yaml

* filter by target repo (#44)

* update deployment (#45)

* useful shell functions (#46)

* Fix JSON Parsing & Target SHA Refresh (#49)

* fix json parsing in force endpoints

* fix target sha updating too

* update deployment

* relax build transition assertion (#50)

* relax build transition assertion

* update deployment

* update deployment

* fix target sha update to hail-ci-build-image (#51)

* Fix build state again (#53)

* ands and ors

* update deployment

* enable deployment of master (#55)

* Deploy Index Missing (#57)

* fix dockerfile to include deploy-index

* update deployment

* fix type error (#58)

* d is already json

* update deployment

* Redeploy takes a SHA not a Ref (#60)

* redeploy takes a sha

* update deployment

* Fix PR Dismissal (#61)

* fix pr dismissal

* update deployment

* fix type error (#62)

* fix type error

* update deployment

* disable deployment on master (#63)

* deploy better (#64)

* deploy better

* also print latest_deployed in status

* update deployment

* fix bug (#66)

* deploy master

* sometimes attributes is missing

* update deployment

* Update deployment.yaml

* update deployment

* Configurable deploying (#67)

* add set_deployable and fix bug in deploy_build_finished

* fix

* update deployment

* Eagerly merge reviewed prs (#52)

* eagerly heal if review change makes pr mergeable

* update deployment

* heal each repo once (#68)

* repos might be duplicated, so use a set

* update deployment

* only log when there is something to forget

* deploy master (#65)

* delete jobs after refreshing so we can run new ones (#69)

* delete jobs after refreshing so we can run new ones

* update deployment

* fix refresh (#70)

* only delete if completed or cancelled

* update deployment

* mount docker sock in deploy job (#71)

* clean up deletion logic (#73)

* fix artifacts location and merge syntax (#72)

* add static ip (#75)

* add static ip

* global addresses apparently dont work

* stop using git sha (#74)

* stop using git sha

* also deploy

* ci cloudtools (#76)

* Generalize deploy and fixes (#77)

* fix deletion logic

* generalize deploy secrets

* add cloudtools

* update deployment

* print the ref name when describing prs to build next (#78)

* fix big error (#79)

* update deployment (#80)

* add bgen-changes non-deploy, set 0.1 to deploy (#81)

* more logging (#84)

* more logging

* update deployment

* deploy cloudtools (#82)

* deploy cloudtools

* Update deployment.yaml

* capitalization on Nealelab (#86)

* update deployment (#87)

* heal watched refs, not just those with active PRs (#85)

In particular, try to deploy any watched ref

* fix image (#88)

* Fix and add test for review condensation (#89)

In particular, we incorrectly let COMMENTED overwrite an APPROVE
simply because it was more recent.

* added gcr-push secret for hail-is/batch (#93)

* added gcr-push secret for hail-is/batch

* make batch deployable

* update deployment (#95)

update Dockerfile to handle new batch file structure

* Quiet down gsutil copies at the end of build logs (#98)

* add index (#100)

* add CORS (#99)

- fix the Makefile
 - add CORS package dependency
 - add CORS to all endpoints

The CI does not serve secret information so accessing it from
arbitrary domains is not a problem.

* Generate temporary repos to test against (#97)

* wip

* a little closer

* bugs & posix

* fixes

* setup endpoints

* give notice of setup-endpoints.sh`

* wait more than one minute

when there are no k8s nodes available it takes about 60 seconds to start one

* bad rebase, should have deleted this line

* pretty sure ssl is not supported in ci2

* stop using ci-test repo and use better temp dir

* hide token in test-locally

* be professional

* address a few comments

* hide TOKEN

* move to subdir (#105)

* move to subdir

* last few things moved
daniel-goldstein referenced this pull request in daniel-goldstein/hail Feb 3, 2022
Import rest_authenticated_users_only to fix ci crash-loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants