# How to generate GTF file from fa file?

Also read `gen_gtf_from_fa-like_txt.r`. Read the file as a data table and select the names of the gene sequences (ie, lines with `>` in them)

In [10]:
data.table::fread('transgene.txt',header = F)->sq
sq[grepl(t(sq),pattern = '>')] -> tgNames

Try to find the sequences that correpsond to each name, and then find the length of each sequence.

In [12]:
paste0(t(sq),collapse = '')->sq1
strsplit(sq1,split = paste0(tgNames$V1,collapse = '|'))->seqs
tail(unlist(seqs), -1) -> seqs

Change the names to remove the `>` before each name, then create the rest of the columns. The source of the names are custom, and the features are `exon` because that's what STARsolo needs. We also assign scores to be `500`, but there's honestly little indications of what the scores might do (just following in the footsteps of hERVd). We assume they're on `+` strand and no frames (we're not looking for proteins anyways...).

In [None]:
sapply(tgNames$V1, function(i) gsub("> ", "", i))->seqname
rep(c("custom_"), each=length(seqname))->source
rep(c("exon"), each=length(seqname))->feature
rep(c(1), each=length(seqname))->start
       
sapply(seqs, function(i) nchar(i))->end
rep(c("500"), each=length(seqname))->score
rep(c("+"), each=length(seqname))->strand # I think they are + right?
rep(c("."), each=length(seqname))->frame

sapply(1:length(seqname), function(x) paste0("gene_id \"custom_", x, "\"", collapse=''))->gene_id
sapply(seqname, function(x) paste0("gene_name \"", x, "\"", collapse=''))->gene_name

mapply(function(x, y) paste0(x, "; ", y), gene_id, gene_name) -> attr

Write the columns to a .gtf file. 

In [None]:
gtf <- data.table(
  seqname=seqname,  
  source=source,
  feature=feature,
  start=start,
  end=end,
  score=score,
  strand=strand,
  frame=frame,
  attribute=attr
)

# don't forget to remove the row names
write.table(gtf, file='transgene.gtf', quote=FALSE, sep='\t', col.names = F, row.names = F)