# Aim

In step four, I'd like to address expanding the phylogeny with more R2R3 MYB subfamiles to address some open questions: 
* Can I add a subfamily label to the other Azfi sequences I found before with a custom hmm.
* Can I infer the phylogeny to more confidently/explicitly draw the Salvinales VII branch as a part of the remaining VII branch by including other subfamiles and an outgroup

In a later notebook, I'll address adding extra data like: 
Next steps: 

* Can I add intron/exon structures as a second line of evidence supporting the different subfamilies
* Finally, I'd like to log2 expressen/fold change to the Azfi sequences; to illustrate how these subfamilies change expression upon transistion to sexual reproduction at least in _Azolla_

Since retrieving sequences from the different databases is quite a pain, 
I'll first retrieve only NCBI listed sequences (which I can retrieve in batch).
Subsequently, I may reitterate in this notebook, expanding on that analysis with sequences from other genome databases.

Genes in new list files in the `data` directory were retrieved from NCBI. 
Acc. nr.s from the J&R study were mixed protein and gene id's, so batch retrieval wasn't possible.
Fasta files with both acc nr's were created in the same directory.

# prepare data

Combining and linearising fasta files.

In [1]:
for i in data/*fasta
do  inseq=$(echo $i | cut -d '/' -f 2 | cut -d '.' -f 1)
    cat data/$inseq.fasta \
     | awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' \
     > data/"$inseq"_linear.fasta
done

In [3]:
ls data/*_linear.fasta

data/ARP_sequences_linear.fasta
data/Azfi-mybs-subfamVI-suspects_linear.fasta
data/Azfi-v1-MYB-sequences_linear.fasta
data/Azfi-v1-MYB-sequences_linear_linear.fasta
data/CDC5-outgroup_sequences_linear.fasta
data/combi_sequences_linear.fasta
data/combi_sequences_linear_linear.fasta
data/combi-VI-VII-Azfisuspects_linear.fasta
data/III_sequences_linear.fasta
data/II_sequences_linear.fasta
data/I_sequences_linear.fasta
data/IV_sequences_linear.fasta
data/MYB33_ARATH_linear.fasta
data/MYB33_ARATH_linear_linear.fasta
data/R1R2R3_sequences_linear.fasta
data/VIII_sequences_linear.fasta
data/VII_sequences_linear.fasta
data/VII_sequences_linear_linear.fasta
data/VI_sequences_linear.fasta
data/VI_sequences_linear_linear.fasta


In [4]:
cat data/CDC5-outgroup_sequences_linear.fasta \
    data/I_sequences_linear.fasta    \
    data/II_sequences_linear.fasta   \
    data/III_sequences_linear.fasta  \
    data/IV_sequences_linear.fasta   \
    data/V_sequences_linear.fasta    \
    data/VI_sequences_linear.fasta   \
    data/VII_sequences_linear.fasta  \
    data/VIII_sequences_linear.fasta \
    data/ARP_sequences_linear.fasta  \
    data/Azfi-mybs-subfamVI-suspects_linear.fasta \
    data/R1R2R3_sequences_linear.fasta \
    data/MYB33_ARATH_linear.fasta
    > data/combi-I-to-VII-Azfi_linear.fasta
    

>CDC5_XM_024677933.1 XP_024533701.1 transcription factor MYB61-like [Selaginella moellendorffii]
MVSRREQPDDHELGGVTSSSSIVRKGLWSPKEDELLLNFILQHGCGNVWTTVPKLAGLQRSGKSCRLRWMNYLRPDLKRGKFSDEENQKLIELHGLVGNRWAYIASQLPGRTDNDVKNQWNSRIRNKALVAHSPVSQDHERSDPPPSPPTPEIPDPAGEEFRDTHLLQSDPAEEPASSPAAKSSAGSSASLQSDGQSPMMDGDLGCVVWQFFMR
>CDC5_XM_024686040.1_XP_024541808.1 cell division cycle 5-like protein, partial [Selaginella moellendorffii]
MRIMIKGGVWKNTEDEILKVAVMKYAKNQWPRISSLLARKSAKQCKARWYEWLDPSIKKTEWTREEDEKLLHLAKLMPTQWRTIAPIVGRTPAQCLQRYEKLLDAACYKDESYEPADDPRKLRPGEIDPNPESKPARPDPVDMDEDEKEMLSEARARLANTRGKKAKRKAREKQLEEARRLAALQKRRELKAACIGSRLRKRKFKGIDYNEEIPFEKRPPPGFYDVANEERSVAQPRFPTNVEELEGRRRSDIEAELRKQDTARNKIAQRQDAPSAIMQISKLNDPEAVRKRTKLMLPAPQISDRELEEIVKMSSSADNLPGEDEGSSATRALVANYNQTPRAGMTPARTPGGKGDAIMMEAENLLRLRETQTPLFGGENPELHPSDFSGVTPKKREVQTPNVIATPLTTPGGVGSTPRIGSTPRETSFAMTPKGTPIRDEFHINEGLELAADNPKAEKLRQAEARRNLQASLKGLPNAKNLYQITVLGVPTAQEEAEEEMEADMADVIANAQAEEDAREAAALRKRSKVLQQGLPRPPPATVELIRNTIPRHAEADDPKALIQKELVALLEHDNAKYP

>IV_Mapoly0032s0153_PTQ41991.1 hypothetical protein MARPO_0032s0153 [Marchantia polymorpha]
MAVSPSTSLDFRCRTQDNERIKGPWSPEEDAALQRLVDKYGARNWSLISKGIPGRSGKSCRLRWCNQLSPQVQHRPFTAAEDAAIIQAHAHHGNKWATIARLLPGRTDNAIKNHWNSTLRRRYLAERSRADEEGSYVCRIRDEDDLTSTLEARKQRCSIDLESSGMAQQLSNEGSSFLDSSSMCGSPLNTSQCAQQVKKLNFAPSSPSCSEHSDPTTVLAPQVFRPVPRPSAFTCFSPSSTTSGGLTKQMQQSPQEASSSSTDPPTSLSLSLPGTCANRIEREVSPRCSPTTRPTLPPPQPIAQAQHSPRSENHHHLRPHEQLKESPPHISAIAWRQPAPPAEHLHEQSGFGIVNNGGAMLVGQAAVPTDMMSAAVRAAVAQALQAPAAAAPPARAAPMMGMGFGLDAAVNAGLLAMMRDMVAKEVQKYMAAVQAPTCLPPFSVMGPDYASLTSHPEFLGTMTPSVPRKAG
cat: data/V_sequences_linear.fasta: No such file or directory
>VI_g8575.t1 length:1504 (mRNA) (CHBRA15g00250)  (myb-related protein 305-like)
MDSGGGAVVDADDSALPAQHQGQSHPQAHGQKVAAQQQSVGGGKGGGAGPAVCKSESDAAAAVGGAHGSSPLSLAGDEASAAAAGGKGVSGGAPAVGPKRPLKRGVWLPEEDEILKGYVANQGPKNWSSIETMGLLARSGKSCRLRWMNHLRPDLNRRSTKFTPEEEAIVVTKQKLHGNKWAQIAKSLSGRTDNDVKNFWNMRMKKLAKLARLEKQRQRQLLISQGSPIAMAAAAMGLPPGARMLASPGDVVVSFNPRTIAEAAAQQRLMVSSSAVPAHHHRLQNFCNASAGLA

>VII_Sacu_v1.1_s0272.g027033 Myb transcription factor [0.077]
MQNRQRYGELIRQTEMEQHQQWCPTYAPPVASAAGSNDGSKKRERANESKGKSSSSRKRQREAAAMGDGRGGNGRLAVQMGGSDGGDGQAERGSADGVNGRATAEMKKGPWTAAEDQKLALYVEKYGEGNWNAVQKVVGVRRCGKSCRLRWTNHLKPGLKKGSLSPQEERIIITQHAALGNRWARIATMLPGRTDNEIKNFWNTRMKRHARANLPLYPAEVVASGAAIAKLEQATWTGAHAKWSEDIARGWSRSDANLAKTTPPAATTTPTTSSLLFPPSSALPQHQRQQSVVRPLQKQSFCLDEYPSRGLIVNHEANTLQKLRHACNDAGYSSGYGGGSTTGGGYAGAVGTGGYGGAASCGAAIGGTVPGAIGGDGGAIGGDKPPSTLACANNTIFLELPSVQSAESADSASSCLSNHHNYGVDHYDVDNHRFAGNQIPHVADNIQFSGNQIPHVVGNRRFAGNQIPHVADNCRFAGNQIHYIVANHRFTGNHYHHPAGADGVGGKANAAAGTDGLAGKMHTGIDRFAEKQHQHTAVEGLTGRQHYCTGNERLAVKQHHHTAGNNLFAGNSIYNGLGYSLLEDVLMLRQQGSGGAVHMQYEAENNELPCTLTEEEEDLFLAESQHFATSLPESETMASIDPLSFLGGRSLTLLADSDLKQVAEKPESSSSCSSNQQHQTLDYCIDNSTQQPAMRRNEYGEASLAHYVVKEEEDGMICREEEEEDEKEEEELYTLFTFGLASEQASEDKKPKFGGISDHKNSIAAEFGGNLEFTWTYIP
>VII_MA_96853p0010[582 residues] 
MALVVYGNVAASASASASASATGATEQATEEVGTLKKGPWTSAEDAILVEYVKKHGEGNWNAVQKHSGLSRCGKSCRLRWANHLRPNLRKGAFTTEEERKILELHAKLGNKWARMAAQLPGRTD

>Azfi_s0014.g013584
MGRAPCCEKESVKRGPWTPEEDAKLLSCVAQHGTGSWRTVPKKAGLQRCGKSCRLRWTNYLRPDLKHGRFSDHEEQTIVRLHAALGSRWSLIAAQLPGRTDNDVKNYWNTRLKKKLCEMGIDPITHKPISQLLADLAGSMAVPVAGSSMPTTSTHMVGGRTIAEAALGCFKDEMLNVIMRRNPNTMVQQHHHHNHHNHPSSSSSPNLSLVSAPPQSVDHMSSLSLSFQQRPSSFTSSGSGSANANGFFAHGFEHDQQLINNSKPLMHMLLESTPVTPTTHHMASSSSQYAFHRPDTMGAQMRQTSSFINNNNNTNSNNNNHHHGNDIHLPMASHNLATSSITFPAQMQLNTPTTTSTSTTSASTNFSTLSLAQSAEKFLQANPTAVQPWPQLVPDQQEEEEEEEEEGEEEEESKKRRLVNNSMEDAHQELNNIASECHWRGDQQLNEDEDEDDEINQLRQFSNTTPPITYNINSSSPTSLQSRMNTS
>Azfi_s0021.g015882
MEQHQRDHRRHFYHMIHTDYNNKQQQDIAVHHKSQQQQEEEEEEEEKLKRKQEQEDAEEDAQGSNTMMKKGPWTAAEDQLLMAYVEKYGEGNWNSVQKLSGVWRCGKSCRLRWTNHLKPGLKKGSLTPEEERVIITQHAALGNRWARIATMLPGRTDNEIKNFWNTRMKRHIRARLPLYPTDVITASNSHPDPTSTSSLSSTKGAESEGTGSSMNKDIDLLNAPKGQQAQPTPNVSSLNDRCISRGASIPYGSLARNDMKIYQSHMNRRSLINYPYGMSPSPLIGGPVHSLKRARNGDSSSTGEDGRPSSQPFSANLFLKSPDVRGSSYPLKLDAFSQGLGFPSSSSYGIPKPPDIHSVTSNNTTGMSFKSSTISLNKNNTEFGVSGPSFPYSVFKSENSSIHLSGLKVELPSVQLAESADSAGTPSSCLSTHSLSNETLNTTFHNSSNSNNHYINDDVDSSSLLLDEILEK