# Ancestral states

The polarized sites of two species have been analyzed:

- [H. sapiens](#human)
- [A. thaliana](#thali)


To be able to run this notebook it is required to run previously the ones in the following folders: rotational. In addition, some external data needs to be downloaded. In each section you can find further details.

## H. sapiens <a id="human"></a>


Create a folder named **sapiens** and place inside:

- The data downloaded from: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz20way/  
Note that not all the files are required. Only the maf files.

- The chain to perform the hg19 to hg38 liftover,that can be downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

Compute the polarized sites. The ``polarized_rot_high.tsv.gz`` file contains polarized sites that intersect with nucleosomes of high rotational dyads.

---
As our data is in hg38 and the nucleosome files are in hg19 we decided to do the liftover for the nucleosome files.

In [None]:
%%bash 

source activate env_nucperiod

rotational=${PWD}/../rotational/sapiens
scripts=${PWD}/scripts

cd sapiens

python ${scripts}/sapiens.py maf polarized_unsorted.bed.gz

zcat polarized_unsorted.bed.gz | sort -k1,1 -k2,2n | gzip > polarized.bed.gz

source deactivate
source activate env_crossmap
CrossMap.py bed hg19ToHg38.over.chain.gz ${rotational}/high_rotational_dyads.bed.gz rot_high.bed
gzip -f rot_high.bed
zcat  rot_high.bed.gz |  sort -k1,1 -k2,2n |gzip > dyads_hg38_rot_high.bed.gz
source deactivate
source activate env_doe

# intersect with the nucleosomes
zcat dyads_hg38_rot_high.bed.gz | awk '{ OFS="\t"; }{print $1, $2-73, $3+73, $1 "_" $2 "_" $3}' | \
    intersectBed -a polarized.bed.gz -b stdin -wo -sorted | \
    awk '{ OFS="\t";}{ print $9, $2-$7-73, $4, $5}' | \
    gzip > polarized.tsv.gz

## A. thaliana <a id="thali"></a>


Create a folder named **thaliana** and place inside:

- The enome of Brassica Rapa referenced to Thaliana TAIR10 genome downloaded from ftp://ftp.ensemblgenomes.org/pub/plants/release-40/maf/arabidopsis_thaliana_TAIR10_vs_brassica_rapa_IVFCAASv1_lastz_net.tar.gz

- The genome of Arabidopsis Lyrata referenced to Thaliana TAIR10 genome downloaded from ftp://ftp.ensemblgenomes.org/pub/plants/release-40/maf/arabidopsis_thaliana_TAIR10_vs_arabidopsis_lyrata_v_1_0_lastz_net.tar.gz


Please, extract the files of these tar files before running the notebook.

Compute the polarized sites (``polarized.tsv.gz``) that intersect with nucleosomes with rotatational score equals to 1.

In [None]:
%%bash

source activate env_nucperiod

rotational=${PWD}/../rotational/thaliana
scripts=${PWD}/scripts

cd thaliana

# Parse the input files to get the sequences
python ${scripts}/thaliana.py parse arabidopsis_thaliana_TAIR10_vs_arabidopsis_lyrata_v_1_0_lastz_net arabidopsis_lyrata lyrata_unsorted.bed.gz
zcat lyrata_unsorted.bed.gz | sort -k1,1 -k2,2n | gzip > lyrata.bed.gz
python ${scripts}/thaliana.py parse arabidopsis_thaliana_TAIR10_vs_brassica_rapa_IVFCAASv1_lastz_net brassica_rapa rapa_unsorted.bed.gz
zcat rapa_unsorted.bed.gz | sort -k1,1 -k2,2n | gzip > rapa.bed.gz

# Find common sequences
intersectBed -a rapa.bed.gz -b lyrata.bed.gz -wo -sorted | gzip > thaliana_rapa_lyrata.intersect.gz

zcat thaliana_rapa_lyrata.intersect.gz | awk '{OFS="\t"}{if($4==$9){print $1, $2, $3, $4, $9, $5}}' | \
    gzip > thaliana_rapa_lyrata.intersect.equal_ancestor.bed.gz

# Find polarized sites
python ${scripts}/thaliana.py find thaliana_rapa_lyrata.intersect.equal_ancestor.bed.gz polarized.bed.gz

# intersect with the nucleosomes
zcat ${rotational}/score1_rotational_dyads.gz | \
    awk '{ OFS="\t"; }{print $1, $2-73, $3+73, $1 "_" $2 "_" $3}' | \
    intersectBed -a polarized.bed.gz -b stdin -wo -sorted | \
    awk '{ OFS="\t";}{ print $9, $2-$7-73, $4, $5}' | \
    gzip > polarized.tsv.gz