Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 5: A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays #6

Open
ttimbers opened this issue Jun 13, 2016 · 5 comments

Comments

@ttimbers
Copy link
Contributor

Project:
Transcription factor binding sites are important regulatory elements found upstream or downstream of a gene's transcription start site. These DNA-binding sites are non-exact, often represented by positional probabilities in a matrix, and also appear to have slightly different affinities across different ChIP-Seq assays. Here, we propose a framework to evaluate profiles from DNA-binding site collections (JASPAR, HocoMoco, UniPROBE, Jolma et al., TRANSFAC) versus what is found in peaks called from ChIP-Seq assays. The input is a position-weight matrix (PWM) representing the DNA profile for a given binding site of interest. The first part would automatically query the ENCODE project's API for experiments targeting the appropriate gene for the profile. The sequence at the respective peaks would be extracted for scanning using the PWM. The goal is to find how well the PWM agrees with what's found in experimental data. The output would be a summary of the profile's representation across sequences, and statistics on the number of possible matches found per sequences. Depending on which experiments are queried, further aims can include:

  • Comparing the profiles from alternative databases and versions to identify the most accurate representations per experiments.
  • Determine whether a database better represents a given organism's binding site (Mouse or Human).
  • Using the same approach, identify profiles for binding sites not targeted by the experiment but also frequently located on the peaks.

Ideally, this project would be about 1.5-2.0 days of development, and 1-1.5 days of experimentation and attempt to answer questions using the project. Interesting skills for these projects would include: -Software development, scripting, object-oriented programming, REST APIs.

  • Experience with transcription factor binding sites, motif discovery.
  • Prior research with transcription factors and co-factor interactions.

Project Lead: Manuel Belmadani / @mbelmadani / Industry Professional / University of British Columbia

@ttimbers ttimbers changed the title Develop aframework to evaluate profiles from DNA-binding site collections (JASPAR, HocoMoco, UniPROBE, Jolma et al., TRANSFAC) versus what is found in peaks called from ChIP-Seq assays Develop aframework to evaluate profiles from DNA-binding site collections versus what is found in peaks called from ChIP-Seq assays Jun 13, 2016
@Brittdrog Brittdrog changed the title Develop aframework to evaluate profiles from DNA-binding site collections versus what is found in peaks called from ChIP-Seq assays A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays Jun 13, 2016
@sjackman
Copy link
Contributor

We're planning to have a Docker image with a bunch of bioinformatics software preinstalled running on machines at the BC Cancer Agency Genome Sciences Centre during the Hackathon. Which bioinformatics software do you plant to use for your project? In particular, is there any software that you plan to use that is not already listed here? http://www.bcgsc.ca/services/orca

@mbelmadani
Copy link

Hi Shaun,

Most of the tools I had in mind are either straightforward to install or already in the ORCA image. But just in case, here's a few extras I was thinking of using:

MOODS - https://www.cs.helsinki.fi/group/pssmfind/ - PWM matching algorithms

Also, MOODS requires a C++ compiler and probably the package python-dev (headers needed to build Python C extensions.)

I see that MEME is listed in ORCA software. Is this the entire MEME Suite, or just the MEME motif discovery tool? The MEME Suite also includes bunch of relevant tools we may use, so I wouldn't mind having the suite installed, if possible. http://meme-suite.org/doc/download.html?man_type=web

On the same page, there's the "Motif Databases" link which we'd probably need. They're just plain text files, but if you can also go ahead and pre-download them on the image. Let me know where they will sit on the filesystem!

That's all I can think of right now. Thanks for looking into this!

Cheers,

@sjackman
Copy link
Contributor

sjackman commented Jul 1, 2016

Hi, Manuel. I believe MEME is the whole suite.
http://meme-suite.org/meme-software/4.10.1/meme_4.10.1_3.tar.gz
It's installed using Homebrew/Linuxbrew brew install homebrew/science/meme
See http://brew.sh and http://linuxbrew.sh

I'll create a ticket to install MOODS. hackseq/October_2016#41

We'll download data/databases at the start of the Hackathon, unless they're unusually large and downloading is expected to be a delay to being productive, in which case we can look into download them in advance.

@mbelmadani
Copy link

mbelmadani commented Jul 11, 2016

That's great, thanks!

It should be fine to download the databases the first day of the Hackathon.

@sjackman sjackman changed the title A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays Project 5: A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays Aug 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants