# Project Premature Stop Codons in *Arabidopsis thaliana*



To manage live in an ecosystem organisms need to adapt to the specific abiotic and biotic factors arount them. Especially for plants adaptation is important because they can not change their habitat during their lifespan. Adaptation can happen through multiple changes on cell, tissue and organism level, but the fundamental process that drive adaptation on the smallest scale are mutations. There are three categories of mutations, which have different effects on the process of transkription of DNA to RNA and the translation from the RNA to the protein. The first category of mutations are synonymous mutations. Synonymous mutations are point mutations in the DNA, which lead to a change in the  sequence of basepairs. **TODO ongoing introduction**

## Setup Project structure

## Ara11-Annotation

### Analysing positions of gene locations 

For determining length or position of genes or RNAs we need information about the start of the 5'UTR, the start of the coding sequence, the stop of the coding sequence, the stop of the 3' UTR and the orientation of the gene on the DNA strand for each single gene in *A. thaliana*. For getting a detailed look at the different genes we actually look at the start  and stop of the protein as well as the UTR regions. This leads to multiple entries for different splice variants for one gene. Different splicing variants result in different protein start and stop and can also result in different 5'UTR and 3'UTR regions depending on the annotation in the Araport 11 file. We store all genes, which don't have a protein as a result for example miRNAs in a different table. For them only the information of where the gene starts and ends is available. We also generate a table for transposons, which contains the start and stop known for these transposons.

In [1]:
%run ../scripts/Ara11_Annotation.py

In [2]:
ara11_genes

Unnamed: 0,Name,Orientation,Start_5UTR,Stop_5UTR,Start_Protein,Stop_Protein,Start_3UTR,Stop_3UTR
0,AT1G01010.1,+,3631,3759,3760,5630,5631,5899
1,AT1G01020.1,-,8667,9130,6915,8666,6788,6914
2,AT1G01020.2,-,8667,8737,7315,8666,6788,7069
3,AT1G01020.3,-,8443,8464,6915,8442,6788,6914
4,AT1G01020.4,-,8594,9130,6915,8442,6788,6914
...,...,...,...,...,...,...,...,...
48354,ATMG01350.1,+,,,346757,347194,,
48355,ATMG01360.1,-,,,349830,351413,,
48356,ATMG01370.1,-,,,360717,361052,,
48357,ATMG01400.1,+,,,363725,364042,,


## Premature-Stop Attributes

### Approach 1: Geneexpression Differences

For the further analysis we need to have a functional dataset where we are sure that the premature stop codons have actual impact on the plant. For that we thought, that we compare gene expression patterns between the wildtype accessions and the ko-mutant accessions (containg the premature stop codons) and take all premature stop codons, which have a significant difference in these two groups. For that we use RNA-Seq data, which is also available from the 1001 Genomes Project and contains transcriptoms of 727 *A. thaliana* accessions. 

#### Preprocessing

To make the raw data possible to analyze with the Python programming language, the vcf-format is processed to a standard csv-file format. This is done in the R script Preprocessing.r

For a comfortable further analysis we need to simplify the gt-section in a more numerical format as well as filtering the gt-section and the expression dataset for their overlap. The RNA-Seq dataset has 727 different accessions of *A.thaliana*, which have 665 common accessions, where we have RNA-seq and genomic information.

In [4]:
%run ../scripts/Preprocessing_Genexpression.py

#### Confidential Dataset 

We try to categorize the premature stop codons in 3 different classe: unsignificant gene expression change, significant decrease in gene expression and significant increase in gene expression.If a Premature Stop Codon is introduced in an ecotype of *A.thaliana* the gene expression should be significantly reduced. We calculate therefore the number of accessions with and without mutation (WT and ko-Mutant) and their mean.

Important facts (problems) about the analysis: The first problem we face for our analysis is the different annotations used for *A.thaliana*. Araport11 is the most recent annotation of *A. thaliana*, which detects lots of new genes in previously believed non-coding regions. But unfortunately the RNA-Seq data of *A. thaliana* was mapped to the older version which is called TAIR10. Our approach will be based on the analysis of the genomic premature stop codons mapped (annotated) to the Araport11 annotation and selecting just these premature stop codons which are included in our RNA-Seq dataset (annotated with TAIR10). We calculate the number of wildtype and ko-mutant accessions for each premature stop codon and also store the names of these accessions. Then we calculate their mean of expression and the standard deviation. Between the steps we filter out all premature stop codons, which either have no wildtype accession or no ko-mutant accession.

In [6]:
%run ../scripts/Geneexpression_Differences.py

Full List of Premature Stop Codons Ara11 contain:  29029
Filtered List of Premature Stop Codons after removing unknown genes of Ara11:  17046


KeyboardInterrupt: 

First we look at some common features of the premature stop codons in the 1001 Genomes project.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 

gt_section = pd.read_csv('data/preprocessed/GT_Section_Numeric_Overlap.csv', index_col=0)
stop_table = pd.read_csv('data/processed/Genexpression_Differences/Stop_Table_Full.csv', index_col=0)

genes = np.unique(stop_table.Gene)

numbers_gene_count = []
for gene in genes:
    subset = stop_table[stop_table['Gene'] == gene]
    numbers_gene_count.append(subset.shape[0])

numbers_premature_stops = gt_section.sum(axis=1)
accessions = np.arange(0, len(gt_section.index.values))
fig = plt.figure(figsize=(8, 6))
gs = fig.add_gridspec(2, 1)
ax1 = fig.add_subplot(gs[0, 0])
ax1.scatter(accessions, numbers_premature_stops, alpha=0.7)
ax1.set_xlabel('Accessions of A. thaliana')
ax1.set_ylabel('Number of Premature Stop Codons')
ax1.set_title('Distribution of Premature Stop Codons in 1135 Accessions of A. thaliana')
ax2 = fig.add_subplot(gs[1, 0])
ax2.hist(numbers_gene_count, bins=60);
ax2.set_xlim(0, 60);
ax2.set_xlabel('Number of Premature Stop Codons in a gene')
fig.tight_layout()
output_file = "results/figures/Distribution_Premature_Stop_Codons.png"
plt.savefig(output_file, dpi=700, facecolor='w')
plt.show()

## Control with Synonymous and Non-Synonymous Mutations