<a href="https://colab.research.google.com/github/irinaachikhmina/Triplexes/blob/main/1_08_Data_mining_mm_regulatory_data_ChipAtlas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data mining and preprocessing: regulatory data download from ChIP-Atlas for mouse genome

The function takes the following parameters:

* path_in: the input path where the BED files are located (default value is the ChIP-Atlas data for HG38)
* path_out: the output file path where the resulting pickle file will be saved
* experiment_type: the experiment type to filter the files by ('ATC' - ATAC-Seq, 'His' - histone marks, 'Pol' - RNA polymerase, 'Oth' - transcription factors and others, 'DNS' - DNase-Seq, 'InP' - input control, 'Bsf' - bisulfite-Seq, 'NoD' - no description, 'Unc' - unclassified)
* cell_class: the cell class to filter the files by (default value is 'ALL')
* threshold: the threshold to filter the files by (default value is '05')
* antigen: the antigen to filter the files by (default value is 'AllAg')
* cell_type: the cell type to filter the files by (default value is 'AllCell')

The function loops through the files on ChipAtlas to get bed files that meet the defined conditions. For each file that meets the parameters, the function reads the file line by line and extracts the data about experiments' peaks (chromosome, start and end position, cell type) and a list of unique experiments.

The data about experiments peaks is saved as a pickle file at the specified output path. The data about experiments is printed.

## Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import urllib.request
from bs4 import BeautifulSoup

from joblib import Parallel, delayed
from tqdm import tqdm
import time

import pickle

## Defining functions

In [None]:
def ChipAtlasCart(path_in='https://dbarchive.biosciencedbc.jp/data/chip-atlas/data/mm10/assembled/', 
                  path_out='sparse_data.pickle', 
                  experiment_type='His', 
                  cell_class='ALL', 
                  threshold='05', 
                  antigen='AllAg', 
                  cell_type='AllCell'):
    
    # Prepare cell type data
    cell_type_raw = cell_type
    cell_type = ()
    for cell_raw in cell_type_raw:
      cell = cell_raw.replace(" ", "_")
      cell = cell.replace('+', 'PULUS')
      cell = cell.replace('(', 'BRACKETL')
      cell = cell.replace(')', 'BRACKETR')
      cell = cell.replace(',', 'KOMMA')
      cell = cell.replace(',', 'PERIOD')
      cell_type += (cell,)
    
    # Create empty list to store sparse data and experiments
    sparse_data = []
    experiments = []
    cell_types = []

    # Get list of .bed files' links
    response = urllib.request.urlopen(path_in)
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    links = [link.get('href') for link in soup.find_all('a') if link.get('href').endswith('.bed')]

    # Loop through each file to filter based on the defined parameters
    file_paths = []
    for line in tqdm(links, desc="Filtering files"):
      filename = line.split('.')
      if experiment_type not in filename:
        continue
      elif cell_class not in filename:
        continue
      elif threshold not in filename:
        continue
      elif antigen not in filename:
        continue
      elif cell_type not in filename and all(c_type not in filename for c_type in cell_type):
        continue
      else:
        file_path = path_in + line
        file_paths.append(file_path)
    
    # Loop through the list of file paths and process the data
    for file_path in tqdm(file_paths, desc="Processing files"):
        filename = file_path.split('/')[-1]
        cell_type = filename.split('.')[4]
        with urllib.request.urlopen(file_path) as file:
          for line in file:
            if line.startswith(b'track'):
                continue
            cols = line.strip().split(b'\t')
            factor = cols[3].split(b';Name=')[1].split(b'%')[0]
            experiment = cols[3].split(b'ID=')[1].split(b';')[0]
            sparse_data.append((cols[0].decode(), cols[1].decode(), cols[2].decode(), factor.decode(), cell_type,  cols[4].decode()))
            experiments.append(experiment.decode())
            cell_types.append(cell_type)

    # Save the sparse data to a pickle file
    with open(path_out, 'wb') as file:
        pickle.dump(sparse_data, file)


    # Print the number of unique experiments and cell types
    unique_experiments = list(set(experiments))
    unique_cell_types = list(set(cell_types))
    print(f"Found {len(sparse_data)} peaks from {len(unique_experiments)} unique experiments and {len(unique_cell_types)} unique cell types")

    return None

## Download

In [None]:
b_cells = ('B cells', 'Germinal center B-cells')

In [None]:
ChipAtlasCart(path_out='/content/drive/Triplexes/data/mm/pol.pickle',
               experiment_type='Pol',
               cell_class='Bld',
               threshold='05',
               antigen='AllAg',
               cell_type=b_cells)

In [None]:
ChipAtlasCart(path_out='/content/drive/Triplexes/data/mm/dns.pickle',
               experiment_type='DNS',
               cell_class='Bld',
               threshold='05',
               antigen='AllAg',
               cell_type=b_cells)

In [None]:
ChipAtlasCart(path_out='/content/drive/Triplexes/data/mm/tf.pickle',
              experiment_type='Oth',
              cell_class='Bld',
              threshold='05',
              antigen='AllAg',
              cell_type=b_cells)

In [None]:
ChipAtlasCart(path_out='/content/drive/Triplexes/data/mm/his.pickle',
              experiment_type='His',
              cell_class='Bld',
              threshold='05',
              antigen='AllAg',
              cell_type=b_cells)