Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental summarize() functions #5524

Merged
merged 3 commits into from Mar 10, 2019

Conversation

Projects
None yet
3 participants
@tpoterba
Copy link
Collaborator

commented Mar 4, 2019

Experimental. Assigning Patrick because we had discussed this together several months ago.

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

example:

In [4]: %time mt.summarize()
2019-03-04 11:21:51 Hail: INFO: Coerced sorted dataset
Columns
=======
* (Summary):
    Number of records : 2535

* [col] s:
          missing : 0 values (0.00%)
     minimum size : 7
     maximum size : 7
        mean size : 7.00
    sample values : ['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

[Stage 4:>                                                          (0 + 0) / 7]Rows
====
* (Summary):
    Number of records : 24885

* [row] locus:
          missing : 0 values (0.00%)
    contig counts : {'22': 24885}

* [row] alleles:
         missing : 0 values (0.00%)
    minimum size : 2
    maximum size : 4
       mean size : 2.04

* [row] rsid:
          missing : 24885 values (100.00%)
     minimum size : None
     maximum size : None
        mean size : None
    sample values : []

* [row] qual:
    missing : 0 values (0.00%)
    minimum : 12.33
    maximum : 2908564.11
       mean : 36748.63
      stdev : 106749.39
        sum : 914489725.51

* [row] filters:
         missing : 24885 values (100.00%)
    minimum size : None
    maximum size : None
       mean size : None

* [row] info:
    missing : 0 values (0.00%)

* [row] info / AC:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / AF:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / AN:
    missing : 0 values (0.00%)
    minimum : 94
    maximum : 5070
       mean : 4622.33
      stdev : 860.97
        sum : 115026680.00

* [row] info / BaseQRankSum:
    missing : 14 values (0.06%)
    minimum : -111.17
    maximum : 110.10
       mean : -1.38
      stdev : 12.27
        sum : -34315.74

* [row] info / ClippingRankSum:
    missing : 14 values (0.06%)
    minimum : -59.26
    maximum : 80.14
       mean : -1.95
      stdev : 7.07
        sum : -48525.03

* [row] info / DP:
    missing : 0 values (0.00%)
    minimum : 56
    maximum : 65061
       mean : 14559.18
      stdev : 6793.99
        sum : 362305221.00

* [row] info / DS:
    missing : 0 values (0.00%)
     counts : {False: 24885}

* [row] info / FS:
    missing : 0 values (0.00%)
    minimum : 0.00
    maximum : 3200.00
       mean : 102.63
      stdev : 422.98
        sum : 2553917.46

* [row] info / HaplotypeScore:
    missing : 24885 values (100.00%)
    minimum : None
    maximum : None
       mean : None
      stdev : None
        sum : 0.00

* [row] info / InbreedingCoeff:
    missing : 0 values (0.00%)
    minimum : -0.93
    maximum : 0.90
       mean : 0.02
      stdev : 0.12
        sum : 577.00

* [row] info / MLEAC:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / MLEAF:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / MQ:
    missing : 0 values (0.00%)
    minimum : 20.04
    maximum : 60.71
       mean : 47.11
      stdev : 12.61
        sum : 1172337.70

* [row] info / MQ0:
    missing : 0 values (0.00%)
    minimum : 0
    maximum : 0
       mean : 0.00
      stdev : 0.00
        sum : 0.00

* [row] info / MQRankSum:
    missing : 14 values (0.06%)
    minimum : -149.84
    maximum : 64.51
       mean : -1.63
      stdev : 12.59
        sum : -40421.08

* [row] info / QD:
    missing : 0 values (0.00%)
    minimum : 0.00
    maximum : 39.45
       mean : 9.12
      stdev : 6.64
        sum : 227029.13

* [row] info / ReadPosRankSum:
    missing : 15 values (0.06%)
    minimum : -65.58
    maximum : 81.28
       mean : 0.72
      stdev : 5.18
        sum : 17878.76

* [row] info / set:
          missing : 24885 values (100.00%)
     minimum size : None
     maximum size : None
        mean size : None
    sample values : []

[Stage 6:==================================================>        (6 + 1) / 7]Entries
=======
* (Summary):
    Number of records : 63083475

* [entry] AD:
         missing : 6144940 values (9.74%)
    minimum size : 2
    maximum size : 4
       mean size : 2.04

* [entry] DP:
    missing : 6144940 values (9.74%)
    minimum : 0
    maximum : 158
       mean : 6.21
      stdev : 4.49
        sum : 353709503.00

* [entry] GQ:
    missing : 5570135 values (8.83%)
    minimum : 0
    maximum : 99
       mean : 21.67
      stdev : 18.67
        sum : 1246497536.00

* [entry] GT:
                 missing : 5570135 values (8.83%)
    Homozygous reference : 49511018
            Heterozygous : 4899765
      Homozygous variant : 3102557
                  ploidy : {2: 57513340}
                  phased : {False: 57513340}

* [entry] PL:
         missing : 5570135 values (8.83%)
    minimum size : 3
    maximum size : 10
       mean size : 3.15
@patrick-schultz
Copy link
Collaborator

left a comment

This is great! It will be fun to add some approximate summaries to this (quartiles, most frequent values, number of unique values...).

d['maximum'] = lambda results: format(results[stats]['max'])
d['mean'] = lambda results: format(results[stats]['mean'])
d['stdev'] = lambda results: format(results[stats]['stdev'])
d['sum'] = lambda results: format(results[stats]['sum'])

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 5, 2019

Collaborator

Should sum be displayed as an integer if the field type is?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Mar 8, 2019

Author Collaborator

good call.

@@ -2896,6 +2896,22 @@ def distinct(self) -> 'Table':

return Table(TableDistinct(self._tir))

def summarize(self):
"""Compute and print summary information about the fields in the matrix table.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 5, 2019

Collaborator

table

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 8, 2019

address comments, added test, added Expression.summarize()

fix

@danking danking merged commit 9b15022 into hail-is:master Mar 10, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.