# overlap of methylated CpG with gene features

Would like to know how many CpG dinucleotides overlap with genomic features, as well as methylated CpGs

the genomic feature files below were generated in [generate_genomic_feature_tracks.ipynb](https://github.com/jgmcdonough/CE18_methylRAD_analysis/blob/master/analysis/genomic_feature_tracks/generate_genomic_feature_tracks.ipynb)

using the [BEDtools suite](https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html)


In [20]:
# gene feature BED files
exonList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_exon_sorted.bed"
intronList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_intron.bed"
exonUTR="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_exonUTR.bed"
promoterList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/mRNA_promoter_track.bed"
intergenicList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_intergenic.bed"
teList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/genomic_feature_tracks/Venkataraman_files/C_virginica-3.0_TE-all.gff"
geneList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_gene_sorted.bed"
noncodingList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_noncoding.bed"
cdsList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_CDS_sorted.bed"

# CpG dinucleotide list
cpgList="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/C_virginica-3.0_CG-motif.bed"


**all CpG dinucleotides** 

In [3]:
!bedtools intersect -u -a {cpgList} -b {geneList} | wc -l
!echo "total CpG dinucleotides in genes"

7774010
total CpG dinucleotides in genes


In [4]:
!bedtools intersect -u -a {cpgList} -b {exonList} | wc -l
!echo "total CpG dinucleotides in exons"

2323389
total CpG dinucleotides in exons


In [5]:
!bedtools intersect -u -a {cpgList} -b {intronList} | wc -l
!echo "total CpG dinucleotides in introns"

5497874
total CpG dinucleotides in introns


In [6]:
!bedtools intersect -u -a {cpgList} -b {exonUTR} | wc -l
!echo "total CpG dinucleotides in UTRs"

600840
total CpG dinucleotides in UTRs


In [7]:
!bedtools intersect -u -a {cpgList} -b {promoterList} | wc -l
!echo "total CpG dinucleotides in putative promoters"

926518
total CpG dinucleotides in putative promoters


In [8]:
!bedtools intersect -u -a {cpgList} -b {intergenicList} | wc -l
!echo "total CpG dinucleotides in intergenic regions"

6644297
total CpG dinucleotides in intergenic regions


In [9]:
!bedtools intersect -u -a {cpgList} -b {noncodingList} | wc -l
!echo "total CpG dinucleotides in non-coding regions"

12142171
total CpG dinucleotides in non-coding regions


In [10]:
!bedtools intersect -u -a {cpgList} -b {teList} | wc -l
!echo "total CpG dinucleotides in transposable elements"

2828372
total CpG dinucleotides in transposable elements


In [11]:
!bedtools intersect -u -a {cpgList} -b {cdsList} | wc -l
!echo "total CpG dinucleotides in coding sequences"

1722555
total CpG dinucleotides in coding sequences


#### methylated CpG by treatment

first, filtering treatment CpG motifs - a CpG dinucleotide is considered methylated if the average count across the four replicates is greater than or equal to 4. This ensures that we're not counting a CpG methylated if one sample has a count of 15 and the rest have 0.

only need to run this code once - then can just read in files below

In [80]:
import pandas as pd

# Load the CSV file
CC_multicov = pd.read_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CC_CpGmulticov.csv')

# Calculate average counts of last 5 columns
CC_multicov['avg_counts'] = CC_multicov.iloc[:, -5:].mean(axis=1)

# Remove rows with averages below 5
CC_filtered = CC_multicov[CC_multicov['avg_counts'] >= 5]

# Remove temporary average column
CC_filtered = CC_filtered.drop('avg_counts', axis=1)

# Select the desired columns
CC_bed = CC_filtered[['chromosome', 'start', 'stop']]

# Save to bed file
CC_bed.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CC_cpgMethyl.bed', 
               sep='\t', 
               header=False, 
               index=False)

In [83]:
# Load the CSV file
HH_multicov = pd.read_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_CpGmulticov.csv')

# Calculate average counts of last 5 columns
HH_multicov['avg_counts'] = HH_multicov.iloc[:, -5:].mean(axis=1)

# Remove rows with averages below 5
HH_filtered = HH_multicov[HH_multicov['avg_counts'] >= 5]

# Remove temporary average column
HH_filtered = HH_filtered.drop('avg_counts', axis=1)

# Select the desired columns
HH_bed = HH_filtered[['chromosome', 'start', 'stop']]

# Save to bed file
HH_bed.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_cpgMethyl.bed', 
               sep='\t', 
               header=False, 
               index=False)

In [83]:
# Load the CSV file
HH_multicov = pd.read_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_CpGmulticov.csv')

# Calculate average counts of last 5 columns
HH_multicov['avg_counts'] = HH_multicov.iloc[:, -5:].mean(axis=1)

# Remove rows with averages below 5
HH_filtered = HH_multicov[HH_multicov['avg_counts'] >= 5]

# Remove temporary average column
HH_filtered = HH_filtered.drop('avg_counts', axis=1)

# Select the desired columns
HH_bed = HH_filtered[['chromosome', 'start', 'stop']]

# Save to bed file
HH_bed.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_cpgMethyl.bed', 
               sep='\t', 
               header=False, 
               index=False)

In [83]:
# Load the CSV file
HH_multicov = pd.read_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_CpGmulticov.csv')

# Calculate average counts of last 5 columns
HH_multicov['avg_counts'] = HH_multicov.iloc[:, -5:].mean(axis=1)

# Remove rows with averages below 5
HH_filtered = HH_multicov[HH_multicov['avg_counts'] >= 5]

# Remove temporary average column
HH_filtered = HH_filtered.drop('avg_counts', axis=1)

# Select the desired columns
HH_bed = HH_filtered[['chromosome', 'start', 'stop']]

# Save to bed file
HH_bed.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_cpgMethyl.bed', 
               sep='\t', 
               header=False, 
               index=False)

In [3]:
# CpG lists for each treatment
CC_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CC_cpgMethyl.bed"
CH_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CH_cpgMethyl.bed"
HC_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HC_cpgMethyl.bed"
HH_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_cpgMethyl.bed"

**methylated CpG in genes**

In [91]:
!bedtools intersect -u -a {CC_CpG} -b {geneList} | wc -l
!echo "methylated CpG for CC overlaps with genes"

!bedtools intersect -u -a {CH_CpG} -b {geneList} | wc -l
!echo "methylated CpG for CH overlaps with genes"

!bedtools intersect -u -a {HC_CpG} -b {geneList} | wc -l
!echo "methylated CpG for HC overlaps with genes"

!bedtools intersect -u -a {HH_CpG} -b {geneList} | wc -l
!echo "methylated CpG for HH overlaps with genes"

86816
methylated CpG for CC overlaps with genes
37891
methylated CpG for CH overlaps with genes
110613
methylated CpG for HC overlaps with genes
101771
methylated CpG for HH overlaps with genes


**methylated CpG in exons**

In [92]:
!bedtools intersect -u -a {CC_CpG} -b {exonList} | wc -l
!echo "methylated CpG for CC overlaps with exons"

!bedtools intersect -u -a {CH_CpG} -b {exonList} | wc -l
!echo "methylated CpG for CH overlaps with exons"

!bedtools intersect -u -a {HC_CpG} -b {exonList} | wc -l
!echo "methylated CpG for HC overlaps with exons"

!bedtools intersect -u -a {HH_CpG} -b {exonList} | wc -l
!echo "methylated CpG for HH overlaps with exons"

57957
methylated CpG for CC overlaps with exons
26233
methylated CpG for CH overlaps with exons
68256
methylated CpG for HC overlaps with exons
64813
methylated CpG for HH overlaps with exons


**methylated CpG in introns**

In [93]:
!bedtools intersect -u -a {CC_CpG} -b {intronList} | wc -l
!echo "methylated CpG for CC overlaps with introns"

!bedtools intersect -u -a {CH_CpG} -b {intronList} | wc -l
!echo "methylated CpG for CH overlaps with introns"

!bedtools intersect -u -a {HC_CpG} -b {intronList} | wc -l
!echo "methylated CpG for HC overlaps with introns"

!bedtools intersect -u -a {HH_CpG} -b {intronList} | wc -l
!echo "methylated CpG for HH overlaps with introns"

29638
methylated CpG for CC overlaps with introns
12045
methylated CpG for CH overlaps with introns
43310
methylated CpG for HC overlaps with introns
37891
methylated CpG for HH overlaps with introns


**methylated CpG in putative promoters**

In [94]:
!bedtools intersect -u -a {CC_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for CC overlaps with promoters"

!bedtools intersect -u -a {CH_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for CH overlaps with promoters"

!bedtools intersect -u -a {HC_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for HC overlaps with promoters"

!bedtools intersect -u -a {HH_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for HH overlaps with promoters"

3529
methylated CpG for CC overlaps with promoters
1606
methylated CpG for CH overlaps with promoters
5650
methylated CpG for HC overlaps with promoters
4889
methylated CpG for HH overlaps with promoters


**methylated CpG in exon UTRs**

In [95]:
!bedtools intersect -u -a {CC_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for CC overlaps with exon UTRs"

!bedtools intersect -u -a {CH_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for CH overlaps with exon UTRs"

!bedtools intersect -u -a {HC_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for HC overlaps with exon UTRs"

!bedtools intersect -u -a {HH_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for HH overlaps with exon UTRs"

4165
methylated CpG for CC overlaps with exon UTRs
1545
methylated CpG for CH overlaps with exon UTRs
5164
methylated CpG for HC overlaps with exon UTRs
4843
methylated CpG for HH overlaps with exon UTRs


**methylated CpG in transposable elements**

In [96]:
!bedtools intersect -u -a {CC_CpG} -b {teList} | wc -l
!echo "methylated CpG for CC overlaps with TEs"

!bedtools intersect -u -a {CH_CpG} -b {teList} | wc -l
!echo "methylated CpG for CH overlaps with TEs"

!bedtools intersect -u -a {HC_CpG} -b {teList} | wc -l
!echo "methylated CpG for HC overlaps with TEs"

!bedtools intersect -u -a {HH_CpG} -b {teList} | wc -l
!echo "methylated CpG for HH overlaps with TEs"

16875
methylated CpG for CC overlaps with TEs
6346
methylated CpG for CH overlaps with TEs
24506
methylated CpG for HC overlaps with TEs
22736
methylated CpG for HH overlaps with TEs


**methylated CpG in intergenic regions**

In [97]:
!bedtools intersect -u -a {CC_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for CC overlaps with intergenic regions"

!bedtools intersect -u -a {CH_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for CH overlaps with intergenic regions"

!bedtools intersect -u -a {HC_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for HC overlaps with intergenic regions"

!bedtools intersect -u -a {HH_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for HH overlaps with intergenic regions"

20079
methylated CpG for CC overlaps with intergenic regions
6567
methylated CpG for CH overlaps with intergenic regions
32953
methylated CpG for HC overlaps with intergenic regions
28033
methylated CpG for HH overlaps with intergenic regions


## no overlaps with genomic features

CpGs that do not overlap any feature (aka unannotated intergenic regions)

In [99]:
# CpG motif
!bedtools intersect -v -a {cpgList} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "CpG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters"

4576705
CpG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters


4499027 CpG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters

In [11]:
!bedtools intersect -v -a {CC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for cont cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters"

!bedtools intersect -v -a {CH_CpG} -b {exonList} {intronList} {teList} {promoterList}  | wc -l
!echo "methylated CpG for cont hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters"

!bedtools intersect -v -a {HC_CpG} -b {exonList} {intronList} {teList} {promoterList}  | wc -l
!echo "methylated CpG for hyp cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters"

!bedtools intersect -v -a {HH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for hyp hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters"

12610
methylated CpG for cont cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
3837
methylated CpG for cont hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
20716
methylated CpG for hyp cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
16967
methylated CpG for hyp hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters


**how many methylated CpG overlap outside of genes? (not in genic regions)**

In [8]:
!bedtools intersect -v -a {CC_CpG} -b {geneList} | wc -l
!echo "methylated CpG for cont cont do not overlap with genes"

!bedtools intersect -v -a {CH_CpG} -b {geneList}  | wc -l
!echo "methylated CpG for cont hyp do not overlap with genes"

!bedtools intersect -v -a {HC_CpG} -b {geneList}  | wc -l
!echo "methylated CpG for hyp cont do not overlap with genes"

!bedtools intersect -v -a {HH_CpG} -b {geneList}  | wc -l
!echo "methylated CpG for hyp hyp do not overlap with genes"

20827
methylated CpG for cont cont do not overlap with genes
6934
methylated CpG for cont hyp do not overlap with genes
33868
methylated CpG for hyp cont do not overlap with genes
28935
methylated CpG for hyp hyp do not overlap with genes


## Proportion overlap

In [110]:
CC_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CC_cpgMethyl.bed"
CH_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CH_cpgMethyl.bed"
HC_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HC_cpgMethyl.bed"
HH_CpG = "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_cpgMethyl.bed"

In [122]:
import pandas as pd

# Define the treatments and feature variables
treatments = {
    'CC': CC_CpG,
    'CH': CH_CpG,
    'HC': HC_CpG, 
    'HH': HH_CpG,
    'allCpG': cpgList
}

features = {
    'TE': teList,
    'exons': exonList,
    'introns': intronList,
    'putativePromoter': promoterList,
    'UTRs': exonUTR,
    'intergenic': intergenicList
}

# Initialize an empty list to store the results
results = []

# Loop over the treatments
for treatment_name, treatment_file in treatments.items():
    # Loop over the features
    for feature_name, feature_list in features.items():
        # Use bedtools intersect to count the overlaps
        overlap_count = !bedtools intersect -u -a {treatment_file} -b {feature_list} | wc -l
        
        # Append the result to the list
        results.append({'genomicFeature': feature_name, 'treatment': treatment_name, 'propOverlap': overlap_count})


# Convert the list to a pandas DataFrame
df = pd.DataFrame(results)

# Print the DataFrame
print(df)

      genomicFeature treatment propOverlap
0                 TE        CC     [16875]
1              exons        CC     [57957]
2            introns        CC     [29638]
3   putativePromoter        CC      [3529]
4               UTRs        CC      [4165]
5         intergenic        CC     [20079]
6                 TE        CH      [6346]
7              exons        CH     [26233]
8            introns        CH     [12045]
9   putativePromoter        CH      [1606]
10              UTRs        CH      [1545]
11        intergenic        CH      [6567]
12                TE        HC     [24506]
13             exons        HC     [68256]
14           introns        HC     [43310]
15  putativePromoter        HC      [5650]
16              UTRs        HC      [5164]
17        intergenic        HC     [32953]
18                TE        HH     [22736]
19             exons        HH     [64813]
20           introns        HH     [37891]
21  putativePromoter        HH      [4889]
22         

In [132]:
#df['propOverlap'] = df['propOverlap'].str.extract('(\d+)')

df.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/propOverlap.csv', index=False)

can't seem to get the propOverlap column formatted correctly, so manipulated locally and then re-uploaded

In [147]:
import pandas as pd

# Load your data into a Pandas DataFrame
df = pd.read_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/propOverlap.csv')

# Group the data by treatment and calculate the sum of propOverlap
treatment_sums = df.groupby('treatment')['propOverlap'].sum().reset_index()

# Merge the original DataFrame with the treatment sums
df_merged = pd.merge(df, treatment_sums, on='treatment', suffixes=('', '_sum'))

# Calculate the proportion of overlap
df_merged['proportion_overlap'] = df_merged['propOverlap'] / df_merged['propOverlap_sum']

# Print the result
print(df_merged)

df_merged.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/propOverlap_decimal.csv', index=False)

      genomicFeature treatment  propOverlap  propOverlap_sum  \
0                 TE        CC        16875           132243   
1              exons        CC        57957           132243   
2            introns        CC        29638           132243   
3   putativePromoter        CC         3529           132243   
4               UTRs        CC         4165           132243   
5         intergenic        CC        20079           132243   
6                 TE        CH         6346            54342   
7              exons        CH        26233            54342   
8            introns        CH        12045            54342   
9   putativePromoter        CH         1606            54342   
10              UTRs        CH         1545            54342   
11        intergenic        CH         6567            54342   
12                TE        HC        24506           179839   
13             exons        HC        68256           179839   
14           introns        HC        43

## Proportion overlap with methylated CpGs

There are 5 individual replicate oysters per treatment combination. A CpG dinucleotide is considered methylated for that treatment if the majority of the replicates (at least 3 individuals out of the 5) have >= 5 sequences. 

In [6]:
import pandas as pd

# List of input CSV files
input_files = [
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CC_CpGmulticov.csv',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/CH_CpGmulticov.csv',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HC_CpGmulticov.csv',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/HH_CpGmulticov.csv'
]

# List of output BED files
output_files = [
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_CC_cpgMethyl.bed',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_CH_cpgMethyl.bed',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_HC_cpgMethyl.bed',
    '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_HH_cpgMethyl.bed'
]

# Loop through input and output files
for input_file, output_file in zip(input_files, output_files):
    # Load the CSV file
    multicov_file = pd.read_csv(input_file)

    # Filter rows with at least 3 columns having 5 or more sequences
    filtered_multi = multicov_file[(multicov_file.iloc[:, -5:] >= 5).sum(axis=1) >= 3]

    # Select the desired columns
    bed_file = filtered_multi[['chromosome', 'start', 'stop']]

    # Save to bed file
    bed_file.to_csv(output_file, sep='\t', header=False, index=False)

In [17]:
CC_CpG = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_CC_cpgMethyl.bed'
CH_CpG = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_CH_cpgMethyl.bed'
HC_CpG = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_HC_cpgMethyl.bed'
HH_CpG = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_methyl_assembly/assembly_pipeline_files/genomic_bed_files/CpG_multicov/PO_HH_cpgMethyl.bed'

**methylated CpG in genes**

In [10]:
!bedtools intersect -u -a {CC_CpG} -b {geneList} | wc -l
!echo "methylated CpG for CC overlaps with genes"

!bedtools intersect -u -a {CH_CpG} -b {geneList} | wc -l
!echo "methylated CpG for CH overlaps with genes"

!bedtools intersect -u -a {HC_CpG} -b {geneList} | wc -l
!echo "methylated CpG for HC overlaps with genes"

!bedtools intersect -u -a {HH_CpG} -b {geneList} | wc -l
!echo "methylated CpG for HH overlaps with genes"

61389
methylated CpG for CC overlaps with genes
68470
methylated CpG for CH overlaps with genes
91333
methylated CpG for HC overlaps with genes
86410
methylated CpG for HH overlaps with genes


**methylated CpG in exons**

In [11]:
!bedtools intersect -u -a {CC_CpG} -b {exonList} | wc -l
!echo "methylated CpG for CC overlaps with exons"

!bedtools intersect -u -a {CH_CpG} -b {exonList} | wc -l
!echo "methylated CpG for CH overlaps with exons"

!bedtools intersect -u -a {HC_CpG} -b {exonList} | wc -l
!echo "methylated CpG for HC overlaps with exons"

!bedtools intersect -u -a {HH_CpG} -b {exonList} | wc -l
!echo "methylated CpG for HH overlaps with exons"

42577
methylated CpG for CC overlaps with exons
46527
methylated CpG for CH overlaps with exons
57498
methylated CpG for HC overlaps with exons
56100
methylated CpG for HH overlaps with exons


**methylated CpG in introns**

In [12]:
!bedtools intersect -u -a {CC_CpG} -b {intronList} | wc -l
!echo "methylated CpG for CC overlaps with introns"

!bedtools intersect -u -a {CH_CpG} -b {intronList} | wc -l
!echo "methylated CpG for CH overlaps with introns"

!bedtools intersect -u -a {HC_CpG} -b {intronList} | wc -l
!echo "methylated CpG for HC overlaps with introns"

!bedtools intersect -u -a {HH_CpG} -b {intronList} | wc -l
!echo "methylated CpG for HH overlaps with introns"

19342
methylated CpG for CC overlaps with introns
22545
methylated CpG for CH overlaps with introns
34622
methylated CpG for HC overlaps with introns
31070
methylated CpG for HH overlaps with introns


**methylated CpG in putative promoters**

In [13]:
!bedtools intersect -u -a {CC_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for CC overlaps with promoters"

!bedtools intersect -u -a {CH_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for CH overlaps with promoters"

!bedtools intersect -u -a {HC_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for HC overlaps with promoters"

!bedtools intersect -u -a {HH_CpG} -b {promoterList} | wc -l
!echo "methylated CpG for HH overlaps with promoters"

2081
methylated CpG for CC overlaps with promoters
2712
methylated CpG for CH overlaps with promoters
4177
methylated CpG for HC overlaps with promoters
3704
methylated CpG for HH overlaps with promoters


**methylated CpG in exon UTRs**

In [14]:
!bedtools intersect -u -a {CC_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for CC overlaps with exon UTRs"

!bedtools intersect -u -a {CH_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for CH overlaps with exon UTRs"

!bedtools intersect -u -a {HC_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for HC overlaps with exon UTRs"

!bedtools intersect -u -a {HH_CpG} -b {exonUTR} | wc -l
!echo "methylated CpG for HH overlaps with exon UTRs"

2964
methylated CpG for CC overlaps with exon UTRs
3250
methylated CpG for CH overlaps with exon UTRs
4207
methylated CpG for HC overlaps with exon UTRs
3986
methylated CpG for HH overlaps with exon UTRs


**methylated CpG in transposable elements**

In [96]:
!bedtools intersect -u -a {CC_CpG} -b {teList} | wc -l
!echo "methylated CpG for CC overlaps with TEs"

!bedtools intersect -u -a {CH_CpG} -b {teList} | wc -l
!echo "methylated CpG for CH overlaps with TEs"

!bedtools intersect -u -a {HC_CpG} -b {teList} | wc -l
!echo "methylated CpG for HC overlaps with TEs"

!bedtools intersect -u -a {HH_CpG} -b {teList} | wc -l
!echo "methylated CpG for HH overlaps with TEs"

16875
methylated CpG for CC overlaps with TEs
6346
methylated CpG for CH overlaps with TEs
24506
methylated CpG for HC overlaps with TEs
22736
methylated CpG for HH overlaps with TEs


**methylated CpG in intergenic regions**

In [15]:
!bedtools intersect -u -a {CC_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for CC overlaps with intergenic regions"

!bedtools intersect -u -a {CH_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for CH overlaps with intergenic regions"

!bedtools intersect -u -a {HC_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for HC overlaps with intergenic regions"

!bedtools intersect -u -a {HH_CpG} -b {intergenicList} | wc -l
!echo "methylated CpG for HH overlaps with intergenic regions"

11929
methylated CpG for CC overlaps with intergenic regions
15066
methylated CpG for CH overlaps with intergenic regions
24496
methylated CpG for HC overlaps with intergenic regions
21279
methylated CpG for HH overlaps with intergenic regions


taking the results above and putting into one df and exporting to CSV

In [35]:
# Define the treatments and feature variables
treatments = {
    'CC': CC_CpG,
    'CH': CH_CpG,
    'HC': HC_CpG, 
    'HH': HH_CpG,
    'allCpG': cpgList
}

features = {
    'TE': teList,
    'exons': exonList,
    'introns': intronList,
    'putativePromoter': promoterList,
    'UTRs': exonUTR,
    'intergenic': intergenicList
}

# Initialize an empty list to store the results
results = []

# Loop over the treatments
for treatment_name, treatment_file in treatments.items():
    # Loop over the features
    for feature_name, feature_list in features.items():
        # Use bedtools intersect to count the overlaps
        overlap_count = !bedtools intersect -u -a {treatment_file} -b {feature_list} | wc -l
        
        # Append the result to the list
        results.append({'genomicFeature': feature_name, 'treatment': treatment_name, 'propOverlap': overlap_count})


# Convert the list to a pandas DataFrame
df = pd.DataFrame(results)

# Print the DataFrame
df

Unnamed: 0,genomicFeature,treatment,propOverlap
0,TE,CC,[11289]
1,exons,CC,[42577]
2,introns,CC,[19342]
3,putativePromoter,CC,[2081]
4,UTRs,CC,[2964]
5,intergenic,CC,[11929]
6,TE,CH,[13304]
7,exons,CH,[46527]
8,introns,CH,[22545]
9,putativePromoter,CH,[2712]


In [36]:
# remove brackets around the numbers
df['propOverlap'] = df['propOverlap'].apply(lambda x: int(x[0]))

df

Unnamed: 0,genomicFeature,treatment,propOverlap
0,TE,CC,11289
1,exons,CC,42577
2,introns,CC,19342
3,putativePromoter,CC,2081
4,UTRs,CC,2964
5,intergenic,CC,11929
6,TE,CH,13304
7,exons,CH,46527
8,introns,CH,22545
9,putativePromoter,CH,2712


adding in the non-overlap info

In [37]:
import subprocess

# Create a dictionary with treatment names and their corresponding non-overlapping counts
non_overlapping_counts = {
    'CC': int(subprocess.check_output(f"bedtools intersect -v -a {CC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'CH': int(subprocess.check_output(f"bedtools intersect -v -a {CH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'HC': int(subprocess.check_output(f"bedtools intersect -v -a {HC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'HH': int(subprocess.check_output(f"bedtools intersect -v -a {HH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'allCpG': int(subprocess.check_output(f"bedtools intersect -v -a {cpgList} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip())
}

# Create a new DataFrame with the non-overlapping counts
non_overlapping_df = pd.DataFrame({
    'genomicFeature': ['no_overlap'] * len(non_overlapping_counts),
    'treatment': list(non_overlapping_counts.keys()),
    'propOverlap': list(non_overlapping_counts.values())
})

# Append the non_overlapping_df to the original DataFrame
df = pd.concat([df, non_overlapping_df], ignore_index=True)

df

Unnamed: 0,genomicFeature,treatment,propOverlap
0,TE,CC,11289
1,exons,CC,42577
2,introns,CC,19342
3,putativePromoter,CC,2081
4,UTRs,CC,2964
5,intergenic,CC,11929
6,TE,CH,13304
7,exons,CH,46527
8,introns,CH,22545
9,putativePromoter,CH,2712


In [40]:
# Group the data by treatment and calculate the sum of propOverlap
treatment_sums = df.groupby('treatment')['propOverlap'].sum().reset_index()

# Merge the original DataFrame with the treatment sums
df_merged = pd.merge(df, treatment_sums, on='treatment', suffixes=('', '_sum'))

# Calculate the proportion of overlap
df_merged['proportion_overlap'] = round(df_merged['propOverlap'] / df_merged['propOverlap_sum']*100, 2)

df_merged

Unnamed: 0,genomicFeature,treatment,propOverlap,propOverlap_sum,proportion_overlap
0,TE,CC,11289,97527,11.58
1,exons,CC,42577,97527,43.66
2,introns,CC,19342,97527,19.83
3,putativePromoter,CC,2081,97527,2.13
4,UTRs,CC,2964,97527,3.04
5,intergenic,CC,11929,97527,12.23
6,no_overlap,CC,7345,97527,7.53
7,TE,CH,13304,112541,11.82
8,exons,CH,46527,112541,41.34
9,introns,CH,22545,112541,20.03


In [41]:
#df['propOverlap'] = df['propOverlap'].str.extract('(\d+)')

df_merged.to_csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/majorityMe_propOverlap.csv', index=False)

## no overlaps with genomic features

CpGs that do not overlap any feature (aka unannotated intergenic regions)

In [28]:
# CpG motif
!bedtools intersect -v -a {cpgList} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "CpG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters"

4576705
CpG motifs do not overlap with exons, introns, transposable elements (all), or putative promoters


In [29]:
!bedtools intersect -v -a {CC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for cont cont do not overlap with exons, introns, transposable elements (all), or putative promoters"

!bedtools intersect -u -a {CH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for cont hyp do not overlap with exons, introns, transposable elements (all), or putative promoters"

!bedtools intersect -u -a {HC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for hyp cont do not overlap with exons, introns, transposable elements (all), or putative promoters"

!bedtools intersect -u -a {HH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l
!echo "methylated CpG for hyp hyp do not overlap with exons, introns, transposable elements (all), or putative promoters"

7345
methylated CpG for cont cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
74977
methylated CpG for cont hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
101314
methylated CpG for hyp cont do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters
95457
methylated CpG for hyp hyp do not overlap with exons, introns, transposable elements (all), intergenic regions or putative promoters


In [34]:
import subprocess

# Create a dictionary with treatment names and their corresponding non-overlapping counts
non_overlapping_counts = {
    'CC': int(subprocess.check_output(f"bedtools intersect -v -a {CC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'CH': int(subprocess.check_output(f"bedtools intersect -v -a {CH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'HC': int(subprocess.check_output(f"bedtools intersect -v -a {HC_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'HH': int(subprocess.check_output(f"bedtools intersect -v -a {HH_CpG} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip()),
    'allCpG': int(subprocess.check_output(f"bedtools intersect -v -a {cpgList} -b {exonList} {intronList} {teList} {promoterList} | wc -l", shell=True).decode().strip())
}

# Create a new DataFrame with the non-overlapping counts
non_overlapping_df = pd.DataFrame({
    'genomicFeature': ['no_overlap'] * len(non_overlapping_counts),
    'treatment': list(non_overlapping_counts.keys()),
    'propOverlap': list(non_overlapping_counts.values())
})

# Append the non_overlapping_df to the original DataFrame
df = pd.concat([df, non_overlapping_df], ignore_index=True)

df

Unnamed: 0,genomicFeature,treatment,propOverlap,non_overlapping_count
0,TE,CC,11289,7345.0
1,exons,CC,42577,7345.0
2,introns,CC,19342,7345.0
3,putativePromoter,CC,2081,7345.0
4,UTRs,CC,2964,7345.0
5,intergenic,CC,11929,7345.0
6,TE,CH,13304,74977.0
7,exons,CH,46527,74977.0
8,introns,CH,22545,74977.0
9,putativePromoter,CH,2712,74977.0
