# Creating Count Arrays

Before the analysis of translation limitation can begin the data from the ribosome profiling experiments must be organized into count arrays. Count arrays are vectors that record the number of reads which map to each base pair or codon position along a transcript. The count arrays will be created inside of a Jupyter notebook which is running inside of the Plastid Conda environment set up in (!!!). Using Plastid to create the count arrays will allow for important adjustments to be made to the data such as applying the p-site offsets made in (!!!) and sub-setting the data to only look at the coding regions of the transcripts. The count arrays will be saved as simple csv tables which can be easily incorporated into further analyses in later sections. 

### Step 17
Load in the python libraries and functions necessary for this pipeline. This includes several functions from plastid and the contents of our setup_utils.py file. 

In [1]:
# Import necessary packages
from plastid import BAMGenomeArray,GTF2_TranscriptAssembler,Transcript
import numpy as np
import pandas as pd
from plastid.plotting.plots import *
import setup_utils as st
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline

In [2]:
# Define the path to important files
gtf_path = "/home/keeganfl/Desktop/Work_Fall_2021/Protocol_test/genome/mouse/"
bam_path = "/home/keeganfl/Desktop/Work_Fall_2021/Protocol_test/seleno_seq/"
save_path = "/home/keeganfl/Desktop/Work_Fall_2021/Protocol_test/position_counts_seleno/"
p_site_path = "/home/keeganfl/Desktop/Work_Fall_2021/data_tables/p-site_offsets/villar/"
experimental = 'Trspfl'
samp_num = '2'

### Step 18
Load in the tables of P-site offsets created in the determining p-site offsets section using the Pandas function read_csv. 

In [3]:
# Load in the table of P-site offsets. 
p_offsets_exp =pd.read_csv(p_site_path + experimental + "_RPF_" + samp_num + "_Aligned.toTranscriptome.out_p-site-offsets", 
                      sep="\t")

p_offsets_cont =pd.read_csv(p_site_path + "control" + "_RPF_" + samp_num + "_Aligned.toTranscriptome.out_p-site-offsets", 
                      sep="\t")

### Step 19
Load in a GTF genome annotation file into python using Plastid’s GTF2_TranscriptAssembler
function. This function will load in the transcripts as an iterator of Plastid’s transcript type objects which we will then convert to a list using Python’s list function. 


In [None]:
# load the transcript annotations from the GTF file.
# GTF2_TranscriptAssembler returns an iterator, so here we convert it to a list.
transcripts = list(GTF2_TranscriptAssembler(open(gtf_path + "mm10.refGene.gtf"),return_type=Transcript))

### Step 20
Load in the Bam file containing the Ribosome Profiling data as a Bam Genome Array using Plastid’s BamGenomeArray() function and map the reads to their corresponding P-sites via the VariableThreePrimeMapFactory custom function in setup_utils.py and Plastid’s set_mapping function.

In [None]:
# Read in the alignments from a BAM file and then have it map to the p-site 
alignments_exp = BAMGenomeArray(bam_path + "subset_" + experimental + "_RPF_" + samp_num + ".bam")
alignments_exp.set_mapping(st.VariableThreePrimeMapFactory(p_offsets=p_offsets_exp))

alignments_cont = BAMGenomeArray(bam_path + "subset_cont" + "_RPF_" + samp_num + ".bam")
alignments_cont.set_mapping(st.VariableThreePrimeMapFactory(p_offsets=p_offsets_cont))

### Step 21
For each transcript object in our list use Plastid’s get_counts function to create a numpy array that contains the number of counts at each position in the transcript.

In [None]:
# create a list to hold the vectors
count_vectors_exp = []
count_vectors_cont = []

# get counts for each transcript
for transcript in transcripts:
    count_vectors_exp.append(transcript.get_counts(alignments_exp))
    count_vectors_cont.append(transcript.get_counts(alignments_cont))

### Step 22
Once the count arrays have been created the information on CDS regions contained in the transcript type objects can be used to alter the count arrays to only cover the CDS regions. 

In [None]:
# Calculate the location of the start and end of the coding region for each transcript. 
cds_starts = []
cds_ends = []

for transcript in transcripts:
    cds_starts.append(transcript.cds_start)
    cds_ends.append(transcript.cds_end)
    
# Create a list of lists containing the counts at each position of the transcript cds regions.
cds_counts_list_exp = []
cds_counts_list_cont = []

for i in range(len(count_vectors_exp)):
    count_vectors_exp[i] = list(count_vectors_exp[i][cds_starts[i]:cds_ends[i]])
    count_vectors_cont[i] = list(count_vectors_cont[i][cds_starts[i]:cds_ends[i]])

### Step 23
Use the add_gene_ids function from setup_utils.py to append the transcript ID and gene ID of each transcript to the start of the count vector.  

In [None]:
st.add_gene_ids(transcripts, count_vectors_exp)
st.add_gene_ids(transcripts, count_vectors_cont)

### Step 24
Filter out any count arrays that are of insufficient length or have insufficient read density. In this example, count arrays which were under 200 base pairs in length or which had a read density below 0.12 reads per base pair were filtered out. 

In [None]:
count_arrays_exp = []
count_arrays_cont = []
for array_e, array_c in zip(count_vectors_exp, count_vectors_cont):
    if len(array_e) > 200 and sum(array_e[2:])/len(array_e[2:]) > 0.15 and sum(array_c[2:])/len(array_c[2:]) > 0.15:
        count_arrays_exp.append(array_e)
        count_arrays_cont.append(array_c)

### Step 25
Save the count arrays to be used in future notebooks. Use the custom save_count_positions function from setup_utils.py so that the count arrays are saved with a header that describes each column which it is easier to read.

In [None]:
st.save_count_positions(count_arrays_exp, save_path + experimental + "_" + samp_num + '_counts.csv')
st.save_count_positions(count_arrays_cont, save_path + "control" + "_" + samp_num + '_counts.csv')