Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure 1C? [2C?] #1831

Closed
ValWood opened this issue Jan 13, 2018 · 59 comments
Closed

Figure 1C? [2C?] #1831

ValWood opened this issue Jan 13, 2018 · 59 comments

Comments

@ValWood
Copy link
Member

ValWood commented Jan 13, 2018

Mock up.
I will update the pombe and the cerevisiae data.
Antonia will prepare human data and the figure

slimming summary

@ValWood
Copy link
Member Author

ValWood commented Jan 13, 2018

@Antonialock this is what I did, you will need to follow similar for human.

I started with the slim set here: https://curation.pombase.org/pombase-trac/wiki/GOslims
see the lists 1) standard slim with swaps
and 2) added for greater coverage for "unknowns" project

See the list below

cerevisiae
Results from slimming
unslimmed but annotated genes (242)
I then checked to see if we missed anything in this list which is well characterised.

I figured that, largely if the SGD curators had annotated BP root node with ND that any mappings from other sources would be to fairly high level terms.

I got the list which had an ND BP root node manual and
ran it through the enrichment tool to double check

subtracted it from the 'unslimmed'-
this gave me a shorter list to check.

This gave me a smaller list to evaluate (119)
I ran enrichement on this list (P=1 to see all annotated terms) then scanned the list to identify any terms not
i) Function in process
ii) response to…
iii) high level (cellular process etc)

These terms have fairly specific annotation so I will add to the list
GO:0072659 protein localization to plasma membrane
GO:0019413 acetate biosynthetic process
GO:0009436 glyoxylate catabolic process
GO:0034079 butanediol biosynthetic process (energy generation)
GO:1901426 response to furfural (these are really detoxification)
GO:0018890 cyanamide metabolic process (really cellular detoxification)
GO:0006276 plasmid maintenance
GO:2000001 regulation of DNA damage checkpoint
GO:0009636 response to toxic substance YNR064C, YMR074C, YOL052C-A, YHL010C (really detoxification)
GO:0071218 cellular response to misfolded protein

double checked, all these are in

@ValWood
Copy link
Member Author

ValWood commented Jan 13, 2018

SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015)
PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)

Note, it is slightly different from
https://www.pombase.org/browse-curation/fission-yeast-go-slim-terms
Protein coding genes not covered by the slim (750 in total):
Gene products with biological process annotation, but not in any of the categories above: 27
Gene products with no biological process annotation: 723
because the terms are slightly more general

@ValWood
Copy link
Member Author

ValWood commented Jan 13, 2018

I will rerun pombe and cerevisia tomorrow.
Antonia can you

  • do human on your next working day and
  • make a new figure with the 3 datasets.

@ValWood
Copy link
Member Author

ValWood commented Jan 13, 2018

  • I'll write a short "method" for legend

@ValWood ValWood changed the title Figure 2 A Figure 1C? Jan 13, 2018
@ValWood ValWood changed the title Figure 1C? Figure 1C? [2C?] Jan 13, 2018
@ValWood
Copy link
Member Author

ValWood commented Feb 11, 2018

@Antonialock you mentioned that I hadn't done the instructions but they are above?
Can you do the bit for human (with the additional terms we discussed, let em know if anything isn't clear) , I'm rechecking pombe and cerevisae now...

@ValWood
Copy link
Member Author

ValWood commented Feb 11, 2018

This is the current list from
https://curation.pombase.org/pombase-trac/wiki/GOslims
after discounting all of the uninformative terms, and checking that nothing know is missed by enrichment.

GO:0140053
GO:0000278
GO:0006810
GO:0007010
GO:0006412
GO:0007031
GO:0030437
GO:0023052
GO:0006520
GO:0032200
GO:0016074
GO:0005975
GO:0070647
GO:0007059
GO:0030163
GO:0055086
GO:0006351
GO:0006260
GO:0071554
GO:1901990
GO:0140013
GO:0006461
GO:0071941
GO:0006355
GO:0006399
GO:0042254
GO:0006457
GO:0006486
GO:0016071
GO:0007005
GO:0006310
GO:1901135
GO:0000747
GO:0006913
GO:0006091
GO:0006914
GO:0098754
GO:0016192
GO:0051186
GO:0007163
GO:0061024
GO:0006629
GO:0006281
GO:0000910
GO:0051604
GO:0007155
GO:0055085
GO:0006766
GO:0006325
GO:0016073
GO:0006915
GO:0006790
GO:0055065
GO:0140056
GO:0000920
GO:0000493
GO:0070941
GO:0007124
GO:0009305
GO:0018342
GO:0000128
GO:0034389
GO:0034276
GO:0007032
GO:0030091
GO:0018345
GO:0006797
GO:0006089
GO:0072659
GO:0019413
GO:0009436
GO:0034079
GO:1901426
GO:0018890
GO:0006276
GO:2000001
GO:0009636
GO:2000001
GO:0071218
GO:0046210

@Antonialock
Copy link
Member

What slimmin tools are you using? I keep getting an error message from http://go.princeton.edu/cgi-bin/GOTermMapper

maybe I'm doing something wrong?
I input the primary gene names for protein coding genes reported by HGNC (doenloaded here: https://www.genenames.org/cgi-bin/statistics )

I enter the slim terms (above + multicelllar specific terms)

I use the GOA_human_GAF downloaded from here: http://geneontology.org/page/download-go-annotations

@ValWood
Copy link
Member Author

ValWood commented Feb 16, 2018

I don't think this will work because the file has Uniprot IDs...
it also has 29082 lines which is quite a lot more than the number of human genes (that's why you are using HGNC IDs they should be a 1:1 list).

Therefore you will need to select a data option for goa_human_hgnc (this will recognise the HGNC IDs. This will seem like you are using the hgnc slim, but you aren't because you over-ride that in the advanced options. It's very confusing....

This will then use the current contents of the GO database mapped to HGNC ID set....

@Antonialock
Copy link
Member

It looks like it ignores IEA and IBA annotations e.g. this gene doesn't slim
https://www.ncbi.nlm.nih.gov/gene/127550

is that as expected?

@ValWood
Copy link
Member Author

ValWood commented Feb 16, 2018

you can select the evidence codes included, are they all selected?
(it includes IEA when I use it?)

@ValWood
Copy link
Member Author

ValWood commented Feb 16, 2018

I can't see that human gene in the GO database...that's probably why. I didn't say this would be straightforward... you need to contact GO helpdesk for that one...

@ValWood
Copy link
Member Author

ValWood commented Feb 16, 2018

actually you can't select evidence for the slimmer, I'm thinking of the enrichment tool.

It's probably because the slimmer tool isn't aware of IBA? do you have an example of a missing IEA (this gene only seems to have IBA).

If so, you will need to mail gotools and tell them to include IBA and any other codes....

@Antonialock
Copy link
Member

It has
glycosphingolipid biosynthetic process | IEA
carbohydrate metabolic process | IEA
?

@Antonialock
Copy link
Member

at least that's what's shown on the entrez gene page
screen shot 2018-02-16 at 13 25 43

@Antonialock
Copy link
Member

oh I see, in amigo it only has IBA. So why does entrez show IEAs?
argh, so confusing

@Antonialock
Copy link
Member

@ValWood
Copy link
Member Author

ValWood commented Feb 16, 2018

Mail gotools and check which evidences (they will probably get back to you today)
Mail GO and ask why human IEAs are not in the GO database.
welcome to my world....

@ValWood
Copy link
Member Author

ValWood commented Feb 26, 2018

@Antonialock an alternative is to try the QuickGO slimmer.
It will work with the ID set (the reason I never use it for pombe is that we don't use UniProt IDs for GO). It will only be possible if it provides a list of "unslimmed genes".

I'm pretty sure from memory that it does because Jane and I used this when we were building the generic slim.

@Antonialock
Copy link
Member

Well unfortunately the QuickGO slimming tool is broken. I sent them a message

"Hi. I'm trying to use the slimming tool but am having multiple problems
https://www.ebi.ac.uk/QuickGO/slimming

I uploaded my own set of BP terms to use as the slimming set.
I then wanted to slim using my own list of uniprot IDs, but got an error message saying I need to limit my own set of gene IDs to 500
I then tried to filter on the "human reference set" but got this error message:
"failed to fetch REST response due to:
org.springframework.web.client.HttpClientErrorException: 400 bad request""

@Antonialock
Copy link
Member

Note to self

The number of human genes that we want to include is 19674

The list can be retrieved using this search:
NOT existence:uncertain AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

Removing the "existence:uncertain" drops the number of genes down from 20245

@Antonialock
Copy link
Member

if you get me the cerevisiae number I can plug them in

@Antonialock
Copy link
Member

see comment within your comment for clarification of the human annotation numbers

@Antonialock
Copy link
Member

Antonialock commented Mar 19, 2018

known vs unknown

@Antonialock
Copy link
Member

what are the 31 missing in cerevisiae @ValWood ? For now I rounded known to make to 100

@ValWood
Copy link
Member Author

ValWood commented Mar 19, 2018

Looks brilliant! I will check the pombe and cerevisiae numbers.

@ValWood
Copy link
Member Author

ValWood commented Mar 20, 2018

what are the 31 missing in cerevisiae

the most recent numbers above were:

SGD total 5915 slimmed 4900(~83%) unslimmed 794+221(1015)
PomBase 5070 slimmed 4336(~85.5%) unslimmed 734=10 (744)

I will check them using your final slim so we use the same slim for everything.
Can you send me jus the IDs as a list?

@Antonialock
Copy link
Member

here
slim list used for human.txt

@ValWood
Copy link
Member Author

ValWood commented Mar 27, 2018

did you definitely use my slim with terms added? I'm sure I had slimmed things which are now not slimming?

@ValWood
Copy link
Member Author

ValWood commented Mar 27, 2018

So I need my list + your additions for human?

@Antonialock
Copy link
Member

I used the list on this page as a base https://curation.pombase.org/pombase-trac/wiki/GOslims
e.g.

GO:0140053 GO:0000278 GO:0006810 GO:0007010 GO:0006412 GO:0007031 GO:0030437 GO:0023052 GO:0006520 GO:0032200 GO:0016074 GO:0005975 GO:0070647 GO:0007059 GO:0030163 GO:0055086 GO:0006351 GO:0006260 GO:0071554 GO:1901990 GO:0140013 GO:0065003 GO:0071941 GO:0006355 GO:0006399 GO:0042254 GO:0006457 GO:0006486 GO:0016071 GO:0007005 GO:0006310 GO:1901135 GO:0000747 GO:0006913 GO:0006091 GO:0006914 GO:0098754 GO:0016192 GO:0051186 GO:0007163 GO:0061024 GO:0006629 GO:0006281 GO:0000910 GO:0051604 GO:0007155 GO:0055085 GO:0006766 GO:0006325 GO:0016073 GO:0006915 GO:0006790 GO:0055065 GO:0140056

@Antonialock
Copy link
Member

my slim list is shown above (posted 9 days ago)

@ValWood
Copy link
Member Author

ValWood commented Mar 27, 2018

but it excludes some of the terms in my extended slim.

Can you just send me your "additional" terms (otherwise i need to complare them one by one).

(I want to only report a single slim in the paper so I need to just add the additioanal terms you used to my extended slim...just to ensure that nothing looks odd).

@ValWood
Copy link
Member Author

ValWood commented Mar 27, 2018

I used the list above, and some terms I used were missing. Sorry this is getting confusing...just send me list you added to my original list....

@Antonialock
Copy link
Member

Antonialock commented Mar 27, 2018

GO:0022414
GO:0032501
GO:0032502
GO:0002376
GO:0140053
GO:0000278
GO:0006810
GO:0007010
GO:0006412
GO:0007031
GO:0023052
GO:0006520
GO:0032200
GO:0016074
GO:0005975
GO:0070647
GO:0007059
GO:0030163
GO:0055086
GO:0006351
GO:0006260
GO:1901990
GO:0140013
GO:0065003
GO:0071941
GO:0006355
GO:0006399
GO:0042254
GO:0006457
GO:0006486
GO:0016071
GO:0007005
GO:0006310
GO:1901135
GO:0006913
GO:0006091
GO:0006914
GO:0098754
GO:0016192
GO:0051186
GO:0007163
GO:0061024
GO:0006629
GO:0006281
GO:0000910
GO:0051604
GO:0007155
GO:0055085
GO:0006766
GO:0006325
GO:0016073
GO:0006915
GO:0006790
GO:0055065
GO:0140056
GO:0000920
GO:0000493
GO:0070941
GO:0009305
GO:0018342
GO:0034389
GO:0034276
GO:0007032
GO:0030091
GO:0018345
GO:0006797
GO:0006089
GO:0072659
GO:0019413
GO:0009436
GO:0034079
GO:2000001
GO:0046210
GO:0008215
GO:0060285
GO:1902224
GO:0009447
GO:0044782
GO:0098542
GO:0034329
GO:0050808
GO:0042060
GO:0045329
GO:0019285
GO:0006069
GO:0032963
GO:0030198
GO:0007030
GO:0007040
GO:0032438
GO:0034067
GO:0045454
GO:0097176
GO:0042423
GO:0031648
GO:0007018
GO:0003341
GO:0032418
GO:0030261
GO:0097120
GO:0006954

@Antonialock
Copy link
Member

that is the exact list I was using

@Antonialock
Copy link
Member

I took your list, and added to it

@Antonialock
Copy link
Member

and removed zero annotations, e.g. flocculation? I guess some spore term,

@Antonialock
Copy link
Member

but if you take your exact list (which I thought I was using? but maybe not) and subtract mine, you'll see the difference?

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

I wanted to use your list, but when I used it some things weren't slimming for cerevisiae and pombe. I know I needed to add some back (cell wall stuff , flocculation etc, but I wasn't sure exactly which ones you removed.....

@Antonialock
Copy link
Member

slim20180328.txt

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

I'm confused. I used the slim terms you provided, then when double checking human the following don't slim?

keratinization 163
skin development 169
tissue development 253
immunoglobulin production 45
spermatogenesis 91
C-terminal protein lipidation 24
cilium movement 21
fertilization 34

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

please just send me the list of IDs that you added to my list.....that is all I need, nothing else.

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

In a file plain text....no control characters or anything.....

@Antonialock
Copy link
Member

see the text file above? It's plain text Val. I don't understand what your problem is. Just copy and paste the IDs into the tool.

GO:0032502 developmental process is in my list so that catches e.g. keratinization
GO:0003341 cilium movement is also in my list..

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

Please just send me the terms you added. I don't know what the problem is but when I paste your list into my list it doesn't work.

I only want the terms you added.

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

I don't know why...

files

@Antonialock
Copy link
Member

I emailed you

@ValWood
Copy link
Member Author

ValWood commented Mar 28, 2018

Your list is pretty damn good ! We are on the same page with what to include/exclude. Most of the stuff that isn't mapped is non-specific.

I enriched the "drop out" and I think we should add a few terms, but I'll open a new ticket for outstanding tasks and close this one....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants