# Characterizing CpG Methylation

I'll describe general methylation trends in the geoduck genome using a concatenation of 10x data from all samples. I'll also generate methylation islands for the geoduck genome based on a perl script from Jeong et al. (2018).

1. Obtain concatenated coverage file
2. Characterize methylation levels for each CpG dinucleotide
3. Determine genomic locations for CpGs
4. Generate methylation islands

## 1. Obtain concatenated coverage file

In [20]:
#Download from gannet
!wget https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/0102/Pg_val_1_bismark_bt2_pe._10x.bedgraph

--2020-03-09 10:31:27--  https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/0102/Pg_val_1_bismark_bt2_pe._10x.bedgraph
Resolving gannet.fish.washington.edu... 128.95.149.52
Connecting to gannet.fish.washington.edu|128.95.149.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39782030 (38M)
Saving to: ‘Pg_val_1_bismark_bt2_pe._10x.bedgraph’


2020-03-09 10:31:30 (13.2 MB/s) - ‘Pg_val_1_bismark_bt2_pe._10x.bedgraph’ saved [39782030/39782030]

--2020-03-09 10:31:30--  http://../Data/
Resolving ..... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘..’
FINISHED --2020-03-09 10:31:30--
Total wall clock time: 3.0s
Downloaded: 1 files, 38M in 2.9s (13.2 MB/s)


In [21]:
#Move to Data folder
!mv *bedgraph ../Data/

In [23]:
#Confirm file was moved
!ls ../Data/*bedgraph

../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph


In [147]:
#Columns: chr, start, end, %meth
!head ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph

Scaffold_01	53	55	3.125000
Scaffold_01	71	73	2.777778
Scaffold_01	95	97	2.040816
Scaffold_01	118	120	0.000000
Scaffold_01	192	194	0.000000
Scaffold_01	201	203	0.000000
Scaffold_01	208	210	0.000000
Scaffold_01	212	214	0.000000
Scaffold_01	220	222	0.000000
Scaffold_01	237	239	1.315789


In [148]:
!wc -l ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph

 1016980 ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph


## 2. Characterize methylation level for each CpG dinucleotide

- methylated: > 50% methylated
- sparsely methylated: 10-50% methylated
- unmethylated: < 10% methylated

### 2a. Methylated CpGs

In [133]:
#If percent methylation is greater or equal to 50, then save the loci information
!awk '{if ($4 >= 50) { print $1, $2, $3, $4 }}' ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph \
> ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph

In [135]:
methylatedCpGs = "../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph"

In [146]:
!head ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph
!wc -l ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph

Scaffold_01 11797 11799 50.000000
Scaffold_01 11838 11840 50.000000
Scaffold_01 11843 11845 50.000000
Scaffold_01 11846 11848 50.000000
Scaffold_01 11851 11853 55.555556
Scaffold_01 12029 12031 56.250000
Scaffold_01 51414 51416 56.834532
Scaffold_01 51426 51428 58.333333
Scaffold_01 51470 51472 55.072464
Scaffold_01 51563 51565 77.083333
  310808 ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph


### 2b. Sparsely methylated CpGs

In [30]:
%%bash
awk '{if ($4 < 50) { print $1, $2, $3, $4}}' ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph \
| awk '{if ($4 > 10) { print $1, $2, $3, $4 }}' \
> ../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph

In [145]:
!head ../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph
!wc -l ../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph

Scaffold_01 617 619 15.384615
Scaffold_01 2962 2964 12.658228
Scaffold_01 7518 7520 12.000000
Scaffold_01 8343 8345 17.073171
Scaffold_01 8347 8349 15.384615
Scaffold_01 8352 8354 22.448980
Scaffold_01 8358 8360 31.506849
Scaffold_01 8366 8368 34.426230
Scaffold_01 8381 8383 26.363636
Scaffold_01 8385 8387 23.931624
   92368 ../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph


### 2c. Unmethylated CpGs

In [34]:
!awk '{if ($4 <= 10) { print $1, $2, $3, $4 }}' ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph \
> ../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph

In [144]:
!head ../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph
!wc -l ../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph

Scaffold_01 53 55 3.125000
Scaffold_01 71 73 2.777778
Scaffold_01 95 97 2.040816
Scaffold_01 118 120 0.000000
Scaffold_01 192 194 0.000000
Scaffold_01 201 203 0.000000
Scaffold_01 208 210 0.000000
Scaffold_01 212 214 0.000000
Scaffold_01 220 222 0.000000
Scaffold_01 237 239 1.315789
  613804 ../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph


## 3. Determine genomic locations for CpGs

### 3a. Create BEDfiles for `bedtools` and IGV

In [149]:
!find ../Output/*bedgraph

../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph
../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph
../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph


In [150]:
%%bash

for f in ../Output/*bedgraph
do
    awk '{print $1"\t"$2"\t"$3}' ${f} > ${f}.bed
done

In [152]:
!awk '{print $1"\t"$2"\t"$3}' ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph \
> ../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph.bed

### 3b. Set variable paths

In [124]:
exons = "../Data/Genome/Panopea-generosa-v1.0.a4.exon.gff3"

In [125]:
!head {exons}

##gff-version 3
##Generated using GenSAS, Tuesday 26th of November 2019 07:12:25 PM
##Project Name : Pgenerosa_v074
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	1995	2095	.	+	.	ID=PGEN_.00g000010.m01.exon02;Name=PGEN_.00g000010.m01.exon02;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon2;Alias=21510-PGEN_.00g234140.m01.exon2
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	3325	3495	.	+	.	ID=PGEN_.00g000010.m01.exon03;Name=PGEN_.00g000010.m01.exon03;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon3;Alias=21510-PGEN_.00g234140.m01.exon3
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	4651	4719	.	+	.	ID=PGEN_.00g000010.m01.exon04;Name=PGEN_.00g000010.m01.exon04;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g23414

In [126]:
!wc -l {exons}

  236963 ../Data/Genome/Panopea-generosa-v1.0.a4.exon.gff3


In [120]:
genes = "../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3"

In [122]:
!head {genes}

##gff-version 3
##Generated using GenSAS, Tuesday 26th of November 2019 07:12:25 PM
##Project Name : Pgenerosa_v074
Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]
Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	19808	36739	.	-	.	ID=PGEN_.00g000020;Name=PGEN_.00g000020;original_ID=21510-PGEN_.00g234150;Alias=21510-PGEN_.00g234150;original_name=21510-PGEN_.00g234150;Notes=sp|P04177|TY3H_RAT [BLAST protein vs protein (blastp) 2.7.1],sp|P04177|TY3H_RAT [DIAMOND Functional 0.9.22],IPR036951 [InterProScan 5.29-68.0],PF00351.16 [Pfam 1.6]
Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	49248	52578	.	-	.	ID=PGEN_.00g000030;Name=PGEN_.00g000030;original_ID=21510-PGEN_.00g234160;Alias=21510-PGEN_.00g234160;original_name=21510-PGEN_.00g234160;Notes=PF08054.6 [Pfa

In [123]:
!wc -l {genes}

   34950 ../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3


In [127]:
bedtoolsDirectory = "/usr/local/bin/"

In [128]:
!{bedtoolsDirectory}intersectBed -h


Tool:    bedtools intersect (aka intersectBed)
Version: v2.29.2
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

In [153]:
all10xCpGs = "../Data/Pg_val_1_bismark_bt2_pe._10x.bedgraph.bed"

In [154]:
methCpGs = "../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph.bed"

In [157]:
sparseMethCpGs = "../Output/Pg_val_1_bismark_bt2_pe._10x-Sparsely-Methylated.bedgraph.bed"

In [159]:
unMethCpGs = "../Output/Pg_val_1_bismark_bt2_pe._10x-Unmethylated.bedgraph.bed"

### 3c. Exons

#### All 10x CpGs

In [160]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {all10xCpGs} \
-b {exons} \
| wc -l
!echo "all 10x CpG loci overlaps with exons"

  112000
all 10x CpG loci overlaps with exons


In [162]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {all10xCpGs} \
-b {exons} \
> ../Output/Pg-10xCpGs-Exon-Overlaps.txt

In [163]:
!head ../Output/Pg-10xCpGs-Exon-Overlaps.txt

Scaffold_01	53	55	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	71	73	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	95	97	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	118	120	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2

#### Methylated CpGs

In [165]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {methCpGs} \
-b {exons} \
| wc -l
!echo "methylated CpG loci overlaps with exons"

   50915
methylated CpG loci overlaps with exons


In [166]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {methCpGs} \
-b {exons} \
> ../Output/Pg-MethylatedCpGs-Exon-Overlaps.txt

In [167]:
!head ../Output/Pg-MethylatedCpGs-Exon-Overlaps.txt

Scaffold_01	51414	51416	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	51394	51541	.	-	.	ID=PGEN_.00g000030.m01.exon02;Name=PGEN_.00g000030.m01.exon02;Parent=PGEN_.00g000030.m01;original_ID=21510-PGEN_.00g234160.m02.exon2;Alias=21510-PGEN_.00g234160.m02.exon2	2
Scaffold_01	51414	51416	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	51394	51541	.	-	.	ID=PGEN_.00g000030.m02.exon02;Name=PGEN_.00g000030.m02.exon02;Parent=PGEN_.00g000030.m02;original_ID=21510-PGEN_.00g234160.m01.exon2;Alias=21510-PGEN_.00g234160.m01.exon2	2
Scaffold_01	51426	51428	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	51394	51541	.	-	.	ID=PGEN_.00g000030.m01.exon02;Name=PGEN_.00g000030.m01.exon02;Parent=PGEN_.00g000030.m01;original_ID=21510-PGEN_.00g234160.m02.exon2;Alias=21510-PGEN_.00g234160.m02.exon2	2
Scaffold_01	51426	51428	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	51394	51541	.	-	.	ID=PGEN_.00g000030.m02.exon02;Name=PGEN_.00g000030.m02.exon02;Parent=PGEN_.00g000030.m02;original_ID=21510-PGEN_.00g234160.m01

#### Sparsely methylated CpGs

In [169]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {sparseMethCpGs} \
-b {exons} \
| wc -l
!echo "sparsely methylated CpG loci overlaps with exons"

    4968
sparsely methylated CpG loci overlaps with exons


In [170]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {sparseMethCpGs} \
-b {exons} \
> ../Output/Pg-SparselyMethylatedCpGs-Exon-Overlaps.txt

In [171]:
!head ../Output/Pg-SparselyMethylatedCpGs-Exon-Overlaps.txt

Scaffold_01	19818	19820	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	19808	19943	.	-	.	ID=PGEN_.00g000020.m01.exon01;Name=PGEN_.00g000020.m01.exon01;Parent=PGEN_.00g000020.m01;original_ID=21510-PGEN_.00g234150.m01.exon10;Alias=21510-PGEN_.00g234150.m01.exon10	2
Scaffold_01	19941	19943	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	19808	19943	.	-	.	ID=PGEN_.00g000020.m01.exon01;Name=PGEN_.00g000020.m01.exon01;Parent=PGEN_.00g000020.m01;original_ID=21510-PGEN_.00g234150.m01.exon10;Alias=21510-PGEN_.00g234150.m01.exon10	2
Scaffold_01	24937	24939	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	24824	24959	.	-	.	ID=PGEN_.00g000020.m01.exon04;Name=PGEN_.00g000020.m01.exon04;Parent=PGEN_.00g000020.m01;original_ID=21510-PGEN_.00g234150.m01.exon7;Alias=21510-PGEN_.00g234150.m01.exon7	2
Scaffold_01	49258	49260	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	49248	49540	.	-	.	ID=PGEN_.00g000030.m01.exon01;Name=PGEN_.00g000030.m01.exon01;Parent=PGEN_.00g000030.m01;original_ID=21510-PGEN_.00g234160

#### Unmethylated CpGs

In [172]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {unMethCpGs} \
-b {exons} \
| wc -l
!echo "unmethylated CpG loci overlaps with exons"

   56117
unmethylated CpG loci overlaps with exons


In [173]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {unMethCpGs} \
-b {exons} \
> ../Output/Pg-UnmethylatedCpGs-Exon-Overlaps.txt

In [174]:
!head ../Output/Pg-UnmethylatedCpGs-Exon-Overlaps.txt

Scaffold_01	53	55	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	71	73	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	95	97	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2
Scaffold_01	118	120	Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1	2

### 3d. Genes

#### All 10x CpGs

In [175]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {all10xCpGs} \
-b {genes} \
| wc -l
!echo "all 10x CpG loci overlaps with genes"

  473521
all 10x CpG loci overlaps with genes


In [178]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {all10xCpGs} \
-b {genes} \
> ../Output/Pg-10xCpGs-Gene-Overlaps.txt

In [179]:
!head ../Output/Pg-10xCpGs-Gene-Overlaps.txt

Scaffold_01	53	55	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	71	73	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	95	97	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	118	120	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	47

#### Methylated CpGs

In [181]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {methCpGs} \
-b {genes} \
| wc -l
!echo "methylated CpG loci overlaps with genes"

  207303
methylated CpG loci overlaps with genes


In [182]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {methCpGs} \
-b {genes} \
> ../Output/Pg-MethylatedCpGs-Gene-Overlaps.txt

In [183]:
!head ../Output/Pg-MethylatedCpGs-Gene-Overlaps.txt

Scaffold_01	51414	51416	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	49248	52578	.	-	.	ID=PGEN_.00g000030;Name=PGEN_.00g000030;original_ID=21510-PGEN_.00g234160;Alias=21510-PGEN_.00g234160;original_name=21510-PGEN_.00g234160;Notes=PF08054.6 [Pfam 1.6]	2
Scaffold_01	51426	51428	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	49248	52578	.	-	.	ID=PGEN_.00g000030;Name=PGEN_.00g000030;original_ID=21510-PGEN_.00g234160;Alias=21510-PGEN_.00g234160;original_name=21510-PGEN_.00g234160;Notes=PF08054.6 [Pfam 1.6]	2
Scaffold_01	51470	51472	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	49248	52578	.	-	.	ID=PGEN_.00g000030;Name=PGEN_.00g000030;original_ID=21510-PGEN_.00g234160;Alias=21510-PGEN_.00g234160;original_name=21510-PGEN_.00g234160;Notes=PF08054.6 [Pfam 1.6]	2
Scaffold_01	51563	51565	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	49248	52578	.	-	.	ID=PGEN_.00g000030;Name=PGEN_.00g000030;original_ID=21510-PGEN_.00g234160;Alias=21510-PGEN_.00g234160;original_name=21510-PGEN_.00g234160;Notes=

#### Sparsely methylated CpGs

In [169]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {sparseMethCpGs} \
-b {genes} \
| wc -l
!echo "sparsely methylated CpG loci overlaps with genes"

    4968
sparsely methylated CpG loci overlaps with exons


In [184]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {sparseMethCpGs} \
-b {genes} \
> ../Output/Pg-SparselyMethylatedCpGs-Gene-Overlaps.txt

In [185]:
!head ../Output/Pg-SparselyMethylatedCpGs-Gene-Overlaps.txt

Scaffold_01	617	619	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	2962	2964	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	19818	19820	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	19808	36739	.	-	.	ID=PGEN_.00g000020;Name=PGEN_.00g000020;original_ID=21510-PGEN_.00g234150;Alias=21510-PGEN_.00g234150;original_name=21510-PGEN_.00g234150;Notes=sp|P04177|TY3H_RAT [BLAST protein vs protein (blastp) 2.7.1],sp|P04177|TY3H_RAT [DIAMOND Functional 0.9.22],IPR036951 [InterProScan 5.29-68.0]

#### Unmethylated CpGs

In [186]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {unMethCpGs} \
-b {genes} \
| wc -l
!echo "unmethylated CpG loci overlaps with genes"

  225844
unmethylated CpG loci overlaps with genes


In [187]:
! {bedtoolsDirectory}intersectBed \
-wo \
-a {unMethCpGs} \
-b {genes} \
> ../Output/Pg-UnmethylatedCpGs-Gene-Overlaps.txt

In [188]:
!head ../Output/Pg-UnmethylatedCpGs-Gene-Overlaps.txt

Scaffold_01	53	55	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	71	73	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	95	97	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]	2
Scaffold_01	118	120	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	47

## 4. Generate methylation islands

To identify methylation islands using the method from Jeong et al. (2018), I need to define:

- starting size of the methylation window: 500 bp
- minimum fraction of methylated CpGs required within the window to be accepted: 0.02
- step size to extend the accepted window as long as the mCpG fraction is met: 50 bp
- mCpG file: input with mCpG chromosome and bp position

### 4a. Create mCpG file

In [50]:
#Modify mCpG file by removing the third column that is not needed for methylation island analysis
!awk '{print $1"\t"$2}' ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated.bedgraph.bed \
> ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated-Reduced.bed

In [51]:
#Confirm file only has chromosome and start bp for mCpG
!head ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated-Reduced.bed

Scaffold_01	11797
Scaffold_01	11838
Scaffold_01	11843
Scaffold_01	11846
Scaffold_01	11851
Scaffold_01	12029
Scaffold_01	51414
Scaffold_01	51426
Scaffold_01	51470
Scaffold_01	51563


### 4b. Identify methylation islands

In [None]:
#Make perl script executable
!chmod +x methyl_island_sliding_window.pl

In [63]:
#Identify methylation islands using 0.02 mCpG fraction
! ./methyl_island_sliding_window.pl 500 0.02 50 ../Output/Pg_val_1_bismark_bt2_pe._10x-Methylated-Reduced.bed \
> ../Output/Pg-Methylation-Islands-500_0.02_50.tab

In [64]:
#Confirm script worked
!head ../Output/Pg-Methylation-Islands-500_0.02_50.tab

Scaffold_01	70728	71285	12
Scaffold_01	72061	72653	14
Scaffold_01	74715	75504	18
Scaffold_01	77141	78219	22
Scaffold_01	80701	81072	11
Scaffold_01	81352	83745	48
Scaffold_01	84678	85303	14
Scaffold_01	86275	90755	92
Scaffold_01	92491	93893	37
Scaffold_01	142681	144251	38


In [67]:
#Filter by MI length and print MI length in a new column
!awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4"\t"$3-$2}}' ../Output/Pg-Methylation-Islands-500_0.02_50.tab \
> ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab

In [69]:
#Check output
!head ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab
!wc -l ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab

Scaffold_01	70728	71285	12	557
Scaffold_01	72061	72653	14	592
Scaffold_01	74715	75504	18	789
Scaffold_01	77141	78219	22	1078
Scaffold_01	81352	83745	48	2393
Scaffold_01	84678	85303	14	625
Scaffold_01	86275	90755	92	4480
Scaffold_01	92491	93893	37	1402
Scaffold_01	142681	144251	38	1570
Scaffold_01	175654	183961	173	8307
    5489 ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab


In [70]:
#Count max mCpG in an island
#Count min mCpG in an island
!awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab
!awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab

645
11


### 4c. Create BEDfiles for `bedtools` and IGV

In [71]:
#Create tab-delimited BEDfile without additional information
!awk '{print $1"\t"$2"\t"$3}' ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.tab \
> ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.bed

In [72]:
!head ../Output/Pg-Methylation-Islands-500_0.02_50-filtered.bed

Scaffold_01	70728	71285
Scaffold_01	72061	72653
Scaffold_01	74715	75504
Scaffold_01	77141	78219
Scaffold_01	81352	83745
Scaffold_01	84678	85303
Scaffold_01	86275	90755
Scaffold_01	92491	93893
Scaffold_01	142681	144251
Scaffold_01	175654	183961


In [116]:
methylationIslands = "../Output/Pg-Methylation-Islands-500_0.02_50-filtered.bed"

### 4d. Identify genome feature overlaps with methylation islands

#### Genes

In [117]:
!{bedtoolsDirectory}intersectBed \
-wo \
-a {methylationIslands} \
-b ../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3 \
| wc -l
! echo "methylation island overlaps with genes"

    3963
methylation island overlaps with genes


In [106]:
!{bedtoolsDirectory}intersectBed \
-wo \
-a {methylationIslands} \
-b ../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3 \
> ../Output/Pg-Methylation-Islands-Gene-Overlap.txt

In [107]:
!head ../Output/Pg-Methylation-Islands-Gene-Overlap.txt

Scaffold_01	70728	71285	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	70713	81099	.	+	.	ID=PGEN_.00g000060;Name=PGEN_.00g000060;original_ID=21510-PGEN_.00g234190;Alias=21510-PGEN_.00g234190;original_name=21510-PGEN_.00g234190;Notes=sp|Q61043|NIN_MOUSE [DIAMOND Functional 0.9.22],PF04443.7 [Pfam 1.6]	557
Scaffold_01	72061	72653	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	70713	81099	.	+	.	ID=PGEN_.00g000060;Name=PGEN_.00g000060;original_ID=21510-PGEN_.00g234190;Alias=21510-PGEN_.00g234190;original_name=21510-PGEN_.00g234190;Notes=sp|Q61043|NIN_MOUSE [DIAMOND Functional 0.9.22],PF04443.7 [Pfam 1.6]	592
Scaffold_01	74715	75504	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	70713	81099	.	+	.	ID=PGEN_.00g000060;Name=PGEN_.00g000060;original_ID=21510-PGEN_.00g234190;Alias=21510-PGEN_.00g234190;original_name=21510-PGEN_.00g234190;Notes=sp|Q61043|NIN_MOUSE [DIAMOND Functional 0.9.22],PF04443.7 [Pfam 1.6]	789
Scaffold_01	77141	78219	Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	70713	81099	

#### Intergenic

In [118]:
!{bedtoolsDirectory}intersectBed \
-v \
-a {methylationIslands} \
-b ../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3 \
| wc -l
!echo "methylation island overlaps with intergenic regions"

    1568
methylation island overlaps with intergenic regions


In [119]:
!{bedtoolsDirectory}intersectBed \
-v \
-a {methylationIslands} \
-b ../Data/Genome/Panopea-generosa-v1.0.a4.gene.gff3 \
> ../Output/Pg-Methylation-Islands-Intergenic-Overlap.txt

In [114]:
!head ../Output/Pg-Methylation-Islands-Intergenic-Overlap.txt

Scaffold_01	81352	83745
Scaffold_01	84678	85303
Scaffold_01	86275	90755
Scaffold_01	92491	93893
Scaffold_01	142681	144251
Scaffold_01	214446	215376
Scaffold_01	215831	221463
Scaffold_01	223419	223943
Scaffold_01	241343	242796
Scaffold_01	288347	289009
