# Characterization of the human genome annotation Release 19 (GRCh37.p13)


## Preparation .gff3

1. Download [here](https://www.gencodegenes.org/human/release_19.html)

2. Filter 8 columns from .gff3

seqid - name of the chromosome

source - HAVANA or ENSEMBL

type - type of feature

start - start position of the feature (!!!starting at 1)

end - end position of the feature

strand - defined as + (forward) or - (reverse).

attributes - here we can find transcriptID, gene name and gene type 

```bash
awk '{print $1, $2, $3, $4, $5, $7, $9}' gencode.v19.annotation.gff3 > annotation.txt
```

3. Types of features in column 3

```bash
cut -d ' ' -f 3 annotation.txt | sort | uniq

CDS
UTR
exon
gene
start_codon
stop_codon
stop_codon_redefined_as_selenocysteine
transcript
```

We need only **gene** (or **CDS** in future)

```bash
awk '$3 == "gene" {print}' annotation.txt > annotation_genes_new.txt

awk 'BEGIN {OFS="\t"} {match($7, /gene_name=([^;]+)/, gene_name); match($7, /gene_type=([^;]+)/, gene_type); $9 = gene_name[1]; $10 = gene_type[1]; print $0}' annotation_genes_new.txt > lolkek.txt

awk '{print $1, $2, $4, $5, $6, $8, $9}' lolkek.txt > hg19_genes.txt

sed 's/ /        /g' hg19_genes.txt > hg19_genes_tab.txt
```

4. Sources for annotation

```bash
cut -d ' ' -f 2 annotation.txt | sort | uniq

ENSEMBL
HAVANA
```

5. Statistics 

Lenght 57820 

From ENSEMBL 9850 annotated genes

From HAVANA 47970 annotated genes

GENES TYPES

```bash
cut -f 7 hg19_genes_tab.txt | sort | uniq -c

     21 3prime_overlapping_ncrna
     14 IG_C_gene
      9 IG_C_pseudogene
     37 IG_D_gene
     18 IG_J_gene
      3 IG_J_pseudogene
    138 IG_V_gene
    187 IG_V_pseudogene
      2 Mt_rRNA
     22 Mt_tRNA
      5 TR_C_gene
      3 TR_D_gene
     74 TR_J_gene
      4 TR_J_pseudogene
     97 TR_V_gene
     27 TR_V_pseudogene
   5276 antisense
   7114 lincRNA
   3055 miRNA
   2034 misc_RNA
     45 polymorphic_pseudogene
    515 processed_transcript
  20345 protein_coding
  13931 pseudogene
    527 rRNA
    742 sense_intronic
    202 sense_overlapping
   1916 snRNA
   1457 snoRNA
```

- **57820** total number of genes 

- **37** chrM genes (13 protein coding)

- **20345** protein coding **with chrM**

- **57783** genes without chrM 

- **20332** protein coding genes without chrM 

To do `\t` as delimiter in all file

```bash
awk 'OFS="\t" {print $1, $2, $3, $4, $5, $6, $7}' hg19_genes_0_based.bed > hg_19_genes_zero_based.txt
```

**For further analysis, we'll EXCLUDE mitochondrial genes (chr=='chrM')**

```bash
awk '$1 != "chrM"' 


## Make .bed files for bedtools

We need to subtract 1 from the gene start (column #3) in the 1-based file

```bash
awk '{ $3 = $3 - 1; print }'
```

After sort 

```bash
sort -k1,1 -k2,2n 
```

Choose columns with chr, start, stop, gene_name and get _4fields.bed

```bash
awk '{print $1, $3, $4, $5}' hg19_genes_formatted.bed > hg19_genes_4fieldbed.bed
```

-500 from start and +500 to end values in hg19_genes_4fields.bed [link](https://www.sciencedirect.com/topics/medicine-and-dentistry/gene-promoter#:~:text=Promoter%20is%20a%20short%20DNA,of%20a%20potential%20mRNA%20molecule.)

```bash
awk '{ $2 = $2 - 500; $3 = $3 + 500; print }' hg19_genes_4fieldbed.bed > hg19_genes_500.bed

awk -F'\t' '{ $2 = $2 - 500; $3 = $3 + 500; OFS="\t"; print $0}' annotation_genes_sorted.bed > hg19_genes_500_so
rted.bed
```

All work with **hg19_genes_500.bed** bed file without chrM and **hg19_genes_formatted.bed**