Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental summarize() functions #5524

Merged
merged 3 commits into from Mar 10, 2019
Merged

Conversation

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Mar 4, 2019

Experimental. Assigning Patrick because we had discussed this together several months ago.

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Mar 4, 2019

example:

In [4]: %time mt.summarize()
2019-03-04 11:21:51 Hail: INFO: Coerced sorted dataset
Columns
=======
* (Summary):
    Number of records : 2535

* [col] s:
          missing : 0 values (0.00%)
     minimum size : 7
     maximum size : 7
        mean size : 7.00
    sample values : ['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

[Stage 4:>                                                          (0 + 0) / 7]Rows
====
* (Summary):
    Number of records : 24885

* [row] locus:
          missing : 0 values (0.00%)
    contig counts : {'22': 24885}

* [row] alleles:
         missing : 0 values (0.00%)
    minimum size : 2
    maximum size : 4
       mean size : 2.04

* [row] rsid:
          missing : 24885 values (100.00%)
     minimum size : None
     maximum size : None
        mean size : None
    sample values : []

* [row] qual:
    missing : 0 values (0.00%)
    minimum : 12.33
    maximum : 2908564.11
       mean : 36748.63
      stdev : 106749.39
        sum : 914489725.51

* [row] filters:
         missing : 24885 values (100.00%)
    minimum size : None
    maximum size : None
       mean size : None

* [row] info:
    missing : 0 values (0.00%)

* [row] info / AC:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / AF:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / AN:
    missing : 0 values (0.00%)
    minimum : 94
    maximum : 5070
       mean : 4622.33
      stdev : 860.97
        sum : 115026680.00

* [row] info / BaseQRankSum:
    missing : 14 values (0.06%)
    minimum : -111.17
    maximum : 110.10
       mean : -1.38
      stdev : 12.27
        sum : -34315.74

* [row] info / ClippingRankSum:
    missing : 14 values (0.06%)
    minimum : -59.26
    maximum : 80.14
       mean : -1.95
      stdev : 7.07
        sum : -48525.03

* [row] info / DP:
    missing : 0 values (0.00%)
    minimum : 56
    maximum : 65061
       mean : 14559.18
      stdev : 6793.99
        sum : 362305221.00

* [row] info / DS:
    missing : 0 values (0.00%)
     counts : {False: 24885}

* [row] info / FS:
    missing : 0 values (0.00%)
    minimum : 0.00
    maximum : 3200.00
       mean : 102.63
      stdev : 422.98
        sum : 2553917.46

* [row] info / HaplotypeScore:
    missing : 24885 values (100.00%)
    minimum : None
    maximum : None
       mean : None
      stdev : None
        sum : 0.00

* [row] info / InbreedingCoeff:
    missing : 0 values (0.00%)
    minimum : -0.93
    maximum : 0.90
       mean : 0.02
      stdev : 0.12
        sum : 577.00

* [row] info / MLEAC:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / MLEAF:
         missing : 0 values (0.00%)
    minimum size : 1
    maximum size : 3
       mean size : 1.04

* [row] info / MQ:
    missing : 0 values (0.00%)
    minimum : 20.04
    maximum : 60.71
       mean : 47.11
      stdev : 12.61
        sum : 1172337.70

* [row] info / MQ0:
    missing : 0 values (0.00%)
    minimum : 0
    maximum : 0
       mean : 0.00
      stdev : 0.00
        sum : 0.00

* [row] info / MQRankSum:
    missing : 14 values (0.06%)
    minimum : -149.84
    maximum : 64.51
       mean : -1.63
      stdev : 12.59
        sum : -40421.08

* [row] info / QD:
    missing : 0 values (0.00%)
    minimum : 0.00
    maximum : 39.45
       mean : 9.12
      stdev : 6.64
        sum : 227029.13

* [row] info / ReadPosRankSum:
    missing : 15 values (0.06%)
    minimum : -65.58
    maximum : 81.28
       mean : 0.72
      stdev : 5.18
        sum : 17878.76

* [row] info / set:
          missing : 24885 values (100.00%)
     minimum size : None
     maximum size : None
        mean size : None
    sample values : []

[Stage 6:==================================================>        (6 + 1) / 7]Entries
=======
* (Summary):
    Number of records : 63083475

* [entry] AD:
         missing : 6144940 values (9.74%)
    minimum size : 2
    maximum size : 4
       mean size : 2.04

* [entry] DP:
    missing : 6144940 values (9.74%)
    minimum : 0
    maximum : 158
       mean : 6.21
      stdev : 4.49
        sum : 353709503.00

* [entry] GQ:
    missing : 5570135 values (8.83%)
    minimum : 0
    maximum : 99
       mean : 21.67
      stdev : 18.67
        sum : 1246497536.00

* [entry] GT:
                 missing : 5570135 values (8.83%)
    Homozygous reference : 49511018
            Heterozygous : 4899765
      Homozygous variant : 3102557
                  ploidy : {2: 57513340}
                  phased : {False: 57513340}

* [entry] PL:
         missing : 5570135 values (8.83%)
    minimum size : 3
    maximum size : 10
       mean size : 3.15

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

This is great! It will be fun to add some approximate summaries to this (quartiles, most frequent values, number of unique values...).

d['maximum'] = lambda results: format(results[stats]['max'])
d['mean'] = lambda results: format(results[stats]['mean'])
d['stdev'] = lambda results: format(results[stats]['stdev'])
d['sum'] = lambda results: format(results[stats]['sum'])
Copy link
Collaborator

@patrick-schultz patrick-schultz Mar 5, 2019

Should sum be displayed as an integer if the field type is?

Copy link
Collaborator Author

@tpoterba tpoterba Mar 8, 2019

good call.

@@ -2896,6 +2896,22 @@ def distinct(self) -> 'Table':

return Table(TableDistinct(self._tir))

def summarize(self):
"""Compute and print summary information about the fields in the matrix table.
Copy link
Collaborator

@patrick-schultz patrick-schultz Mar 5, 2019

table

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Mar 8, 2019

address comments, added test, added Expression.summarize()

@danking danking merged commit 9b15022 into hail-is:master Mar 10, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants