The learning objectives for this notebook are:

* work with compressed data mounted from `ongaeshi`
* subset `peroba` alignment and metadata files (by region and date), excluding Wuhan-Hu-1
* find the n nearest global neighbors to each of those, and append their sequences and metadata to my files
* remove redundant neighbors and duplicated sequences
* create a maximum likelihood tree with `IQTREE2`
* run `TreeTime` to update the branch lengths based on sample dates
* do a mugration model in `TreeTime` based on region
* make a tree diagram in `R` to display the mugration model, based on the notebooks from Leo
    
It should be noted that this notebook won't run just as is (it could with some adjustments) but is just a record of the shell commands I ran on my VM, `madeline-01`.

It has been scrambled since its original iteration, for streamlining.

In [None]:
ssh madeline-01
mkdir 20220217 #directory for today's data
cd ~/ongaeshi-mnt/databases/

#metadata: peroba_meta.220209_173253.tsv.xz
#alignment: peroba_align.220209_173253.aln.xz

Look at the metadata structure:

In [None]:
#look at the metadata structure:
xzcat peroba_meta.220209_173253.tsv.xz | head -10

#columns are:
#strain	gisaid_id	date	location	location_gisaid	location_coguk	age	gender	gisaid_clade	pango_lineage	timestamp	freq_ACGT	freq_N	seq_hash
#note that dates are of the form: 2021-12-09

Create a metadata file and a `.txt` file of sequence names to use, for all the sequences from England, between July 1, 2021 to February 28, 2022.

In [None]:
#subset English sequences from July 1, 2021 to February 28, 2022 (which hasn't happened yet, but it'll catch everything)

#save the header (column names) to a tsv
xzcat peroba_meta.220209_173253.tsv.xz | head -1 > /home/ubuntu/20220217/peroba_meta.220209_173253.England.tsv

#subset by 'England/NORW-' using xzgrep, and append this to the .tsv with header names you just made
xzgrep 'England/' peroba_meta.220209_173253.tsv.xz >> /home/ubuntu/20220217/peroba_meta.220209_173253.England.tsv

#subset by date: all rows between two dates
python3
>>> import pandas as pd
>>> df = pd.read_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.England.tsv', parse_dates=['date'], sep='\t', header=0)
>>> df['date'].min() #2020-09-15 is the earliest sample
>>> date_subset = df.loc[df['date'].between('2021-07-01','2022-02-28', inclusive='both')]
#this gives 1941 rows

#save relevant metadata as .tsv
>>> date_subset.to_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.tsv', sep='\t', header=True, index=False)
#save record names (format: "England/NORW-XXXXXXX/YYYY" to include as a .txt
>>> date_subset['strain'].to_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.seqnames.txt', sep='\t', header=False, index=False)

>>> Ctrl+D  

Now subset the big fasta file to match, and get rid of any duplicate records.

In [None]:
#subset the fasta file using the .txt file of record names: this step is a bit slow
nohup goalign subset -f /home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.seqnames.txt -i peroba_align.220209_173253.aln.xz -o /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.aln
#20220221: this takes ~25 minutes

#I think another way to do this would be to do: xzgrep -w -A 1 -f  /home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.seqnames.txt peroba_align.220209_173253.aln.xz --no-group-separator
#note that goalign subset converts the fasta records from two-line to multi-line; this might be what messes with uvaia later on

In [None]:
seqkit stats /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.aln

file                                                                            format  type  num_seqs        sum_len  min_len  avg_len  max_len
/home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.aln  FASTA   DNA    256,146  7,659,533,838   29,903   29,903   29,903

In [None]:
#remove duplicate records
nohup goalign reformat fasta -i /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.aln  --ignore-identical 0  -o /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln

In [None]:
seqkit stats /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln

file                                                                                   format  type  num_seqs        sum_len  min_len  avg_len  max_len
/home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln  FASTA   DNA    256,146  7,659,533,838   29,903   29,903   29,903

OK, so no duplicate records were found (according to `--ignore-identical 0` in `goalign reformat`, which does "Ignore duplicated sequences that have the same name and same sequences"), and everything is the same length as Wuhan-Wu-1--good.

Next time, it might be best to remove duplicates while subsetting, as `goalign subset --ignore-identical 2`:
    
`--ignore-identical int   Ignore duplicated sequences that have the same name and potentially have same sequences, 0 : Does not ignore anything, 1: Ignore sequences having the same name (keep the first one whatever their sequence), 2: Ignore sequences having the same name and the same sequence`

This will both save a step and avoid the problem of identical sequences being renamed.

Alternatively, use `goalign dedup`, which removes duplicate sequences (considering the name, or not, depending what you tell it to do): ask Leo if `dedup` should not be used for some reason (could identical sequences with different names be worth keeping?)

Also, note that next time **this would be a good place for masking problematic sites, as it hasn't been done already in Peroba**.

--------------------------------------------------

Make a new metadata file containing only samples from NORW:

In [None]:
#save the header (column names) to a tsv
xzcat peroba_meta.220209_173253.tsv.xz | head -1 > /home/ubuntu/20220217/peroba_meta.220209_173253.NORW.tsv

#subset by 'England/NORW-' using xzgrep, and append this to the .tsv with header names you just made
grep 'England/NORW-' peroba_meta.220209_173253.tsv.xz >> /home/ubuntu/20220217/peroba_meta.220209_173253.NORW.tsv

#count how many records this gives: 3724
xzgrep 'England/NORW-' -c peroba_meta.220209_173253.tsv.xz

From this, keep only the samples between two dates (inclusive), and save the names of those records as a separate `.txt` file

In [None]:
#subset by date: all rows between two dates
python3
>>> import pandas as pd
>>> df = pd.read_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.NORW.tsv', parse_dates=['date'], sep='\t', header=0)
>>> df['date'].min() #2021-01-31 is the earliest sample
>>> date_subset = df.loc[df['date'].between('2021-09-01','2021-12-30', inclusive='both')]
#this gives 1941 rows

#save relevant metadata as .tsv
>>> date_subset.to_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.NORW.20210901_20211230.tsv', sep='\t', header=True, index=False)
#save record names (format: "England/NORW-XXXXXXX/YYYY" to include as a .txt
>>> date_subset['strain'].to_csv('/home/ubuntu/20220217/peroba_meta.220209_173253.NORW.20210901_20211230.seqnames.txt', sep='\t', header=False, index=False)

>>> Ctrl+D   

Get the relevant fasta records using the `.txt` file.  This is taken from a unique set (since NORW is in England and the English samples from this time were all unique, above) so no need to worry about duplicate records.

In [None]:
#subset the England fasta file using the .txt file of record names: this step is a bit slow
nohup goalign subset -f /home/ubuntu/20220217/peroba_meta.220209_173253.NORW.20210901_20211230.seqnames.txt -i /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln -o /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln

#to see if it's still running:
ps xw

#goalign subset renames identical sequence names, eg: 
#2022/02/18 11:44:45 Warning: sequence "Japan/XXXX-XXXX-XXX/YYYY" already exists in alignment, renamed in "Japan/XXXX-XXXX-XXX/YYYY_0001"

Find the 4 nearest neighbours in the England `.aln` file to each sequence.  I forgot to remove my `NORW` samples from the `England` set, so...we'll see what turns up.  The output of this program is a sequence file in `.gzip` format containing both my `NORW` sequences and their nearest neighbours.

In [None]:
#find the 4 nearest neighbours in the England .aln file to each sequence
#template: uvaia -n 4 -r gisaid_database.aln.gz -o 4nn.aln.gz -t 12 query.aln > 4nn.txt
nohup uvaia -n 4 -r /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln -o /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.4nn.aln -t 12 /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln > /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.4nn.txt

The above line took >6h.  The file must be decompressed to work with `rapidnj`.  Next time save the output of `uvaia` to `.gz` so you don't need to use `cat`, below:

In [None]:
cat /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.4nn.aln | gunzip > /home/ubuntu/20220217/NORW.4nn.aln

Check seqkit again: they're all the same length.

In [None]:
seqkit stats /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln
    
file                                                                         format  type  num_seqs     sum_len  min_len  avg_len  max_len
/home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln  FASTA   DNA      1,941  58,041,723   29,903   29,903   29,903

Remove redundant nearest neighbours.  The sequence file containing the nearest neighbours should be: `/home/ubuntu/20220217/NORW.4nn.aln`

In [None]:
#create very fast distance-based neighbour-joining tree as Newick file
nohup rapidnj /home/ubuntu/20220217/NORW.4nn.aln -i fa -t d -n -c 8 -o t -x /home/ubuntu/20220217/NORW.4nn.aln.tre

#select the 100 leaves from that tree that maximize phylogenetic diversity: these are your most interesting neighbours
nohup iqtree -k 100 /home/ubuntu/20220217/NORW.4nn.aln.tre

`/home/ubuntu/20220217/NORW.4nn.aln.tre.pda` now includes the list of 100 non-redundant neighbour sequences and the pruned 4nn tree.

Take the list of nns to use and save it as `seqs_to_keep.txt`, being careful to reformat the names to match the alignment file.  

In [None]:
grep -A 100 "The optimal PD set has 100 taxa:" /home/ubuntu/20220217/NORW.4nn.aln.tre.pda | sed '1d'  | sed 's/^.//;s/.$//;s/_/\//;s/_/\//' > /home/ubuntu/20220217/seqs_to_keep.txt

Now use this list to subset the 4nn.aln file containing all potential neighbour sequences.  When I tried to subset from the NORW file, the output file was empty, so `uvaia` must have taken care that the neighbours are not the sequences themselves.  However, note that many of the reduced refs are `NORW` sequences themselves.

In [None]:
nohup goalign subset -f /home/ubuntu/20220217/seqs_to_keep.txt -i /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln -o /home/ubuntu/20220217/reduced_refs.aln

Next, prune the tree that you made from all the potential neighbours with `rapidnj` down to just the seqs to keep:

In [None]:
nohup gotree prune -r --tipfile /home/ubuntu/20220217/seqs_to_keep.txt -i /home/ubuntu/20220217/NORW.4nn.aln.tre -o /home/ubuntu/20220217/reduced_refs2.tre

This didn't work (`The tree after tip removal is only made of two tips`), so I'm going to make a list called `leftovers.txt` of sequences to remove instead and use that:

In [None]:
grep ">" /home/ubuntu/20220217/NORW.4nn.aln | sed 's/>//' > /home/ubuntu/20220217/allseqs.txt
grep -wv -f /home/ubuntu/20220217/seqs_to_keep.txt /home/ubuntu/20220217/allseqs.txt > /home/ubuntu/20220217/leftovers.txt

In [None]:
nohup gotree prune --tipfile /home/ubuntu/20220217/leftovers.txt -i /home/ubuntu/20220217/NORW.4nn.aln.tre -o /home/ubuntu/20220217/reduced_refs.tre

This tree has 3991 tips!  This was not supposed to happen...

Finally, concatenate the neighbour and query sequence files together, and build a new maximum likelihood tree from all those sequences, using the non-redundant neighbours as the backbone for the tree (this is what `-g` does in `iqtree`).

In [None]:
cat /home/ubuntu/20220217/reduced_refs.aln /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln > /home/ubuntu/20220217/allseqs.aln
iqtree -s /home/ubuntu/20220217/allseqs.aln -m HKY+G -g /home/ubuntu/20220217/reduced_refs.tre -t PARS

`iqtree` throws an error saying that there are duplicated NORW sequences in the alignment, and to rename some of them.  So, maybe some of the neighbours were the tips themselves after all.  The internet is cutting out so much here so I'm going home.  Will get back to it tonight!

A quick fix for now is `goalign dedup /home/ubuntu/20220217/allseqs.aln -o /home/ubuntu/20220217/allseqs_dedup.aln`
That failed, so `seqkit rmdup -s < /home/ubuntu/20220217/allseqs.aln > /home/ubuntu/20220217/allseqs_dedup.aln`: this worked.

Try again: `iqtree -s /home/ubuntu/20220217/allseqs_dedup.aln -m HKY+G -g /home/ubuntu/20220217/reduced_refs.tre -t PARS`

This caused many errors of the form:

`ERROR: ERROR: Taxon <seqname> in constraint tree does not appear in full tree`, for 3991
 taxa.

So, something's wrong with `reduced_refs.tre`.  I suspect the tree doesn't have the same samples in it as `reduced_refs.aln`, which is supported by `gotree stats` crashing.  What would happen if, instead of pruning an existing tree down to the reduced refs, I took the reduced refs and built a tree from scratch with them?  Why was pruning suggested over rebuilding?  Just for speed, or is there a bigger reason?

#I'm going to try this again, but first I'm going to cut all the overlapping sequences between the NORW set out from #the potential nearest neighbours set.

remake the reduced_refs tree using iqtree to begin with

    iqtree -s reduced_refs.aln

The Newick tree is in `reduced_refs.aln.treefile`.  gotree stats struggles even with this, so...maybe:

    grep -o -i "England" reduced_refs.aln.treefile | wc -l
    
This will count the number of sample names with "England" in them in the Newick file.  In here are 100!  So that should work.

Gingerly try adding the NORW sequences in:

    iqtree -s /home/ubuntu/20220217/allseqs.aln -m HKY+G -g /home/ubuntu/20220217/reduced_refs.aln.treefile -t PARS

This finds 48 duplicated sequences.  Check with `grep` to see if you find the same:

    grep ">" reduced_refs.aln > reduced_refs_names.txt
    grep ">" /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln > NORW_names.txt
    grep -f reduced_refs_names.txt NORW_names.txt | grep -c "England"

yes, 48 sequences are duplicated between the refs and the references.

so, go through the neighbours from uvaia and take out the sequences that are also found in NORW.
Potential nearest neighbours multiline fasta: `/home/ubuntu/20220217/NORW.4nn.aln`
NORW multiline fasta: `/home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln`

    grep ">" /home/ubuntu/20220217/NORW.4nn.aln > /home/ubuntu/20220217/4nn_names.txt
    grep -vf NORW_names.txt 4nn_names.txt > unique_neighbours.txt 
    
nope, that took out 1729, I just want 48 gone

For the record:
    
    grep -c "England" /home/ubuntu/20220217/reduced_refs_names.txt 
    100
    grep -c "England" /home/ubuntu/20220217/NORW_names.txt 
    1941
    grep -c "England" 4nn_names.txt 
    3991

    grep -vw -f NORW_names.txt 4nn_names.txt | sed 's/>//' > unique_neighbours_2.txt 
This still gives me 2262.  So 1729 NORW names made it into the nearest neighbours pool: remembering that they haven't been reduced yet, this is actually fine.

Subset the fasta file of nearest neighbours:

    nohup goalign subset -f /home/ubuntu/20220217/unique_neighbours_2.txt -i /home/ubuntu/20220217/NORW.4nn.aln -o /home/ubuntu/20220217/NORW.4nn.no_overlap.aln
    
Note that it's important for goalign subset that there are no ">" in the sequence names .txt, otherwise it won't run.

Now reduce that dataset:

#create very fast distance-based neighbour-joining tree as Newick file

    nohup rapidnj /home/ubuntu/20220217/NORW.4nn.no_overlap.aln -i fa -t d -n -c 8 -o t -x /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre
    
#count how many leaves there are now:

    grep -o -i "England" /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre | wc -l
    
2262!  great.

#select the 100 leaves from that tree that maximize phylogenetic diversity: these are your most interesting neighbours

    nohup iqtree -k 100 /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre
    
#get their names:
    
    grep -A 100 "The optimal PD set has 100 taxa:" /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre.pda | sed '1d'  | sed 's/^.//;s/.$//;s/_/\//;s/_/\//' > /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt
    
#subset the neighbours fasta file again:

    nohup goalign subset -f /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt -i /home/ubuntu/20220217/NORW.4nn.no_overlap.aln -o /home/ubuntu/20220217/reduced_refs_no_overlap.aln
    
    
#make a tree out of those:
    
    iqtree -s /home/ubuntu/20220217/reduced_refs_no_overlap.aln
     
    
#count again to be sure:

    grep -o -i "England" reduced_refs_no_overlap.aln.treefile | wc -l
    
#it's 100 as hoped
    
#concatenate NORW and reduced ref sequences together:
    cat /home/ubuntu/20220217/reduced_refs_no_overlap.aln /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln > /home/ubuntu/20220217/allseqs_no_overlap.aln

#and now try adding the NORW sequences to that backbone.  Note that allseqs is important because there is no sequence data in the .tre itself at the moment.

    iqtree -s /home/ubuntu/20220217/allseqs_no_overlap.aln -m HKY+G -g /home/ubuntu/20220217/reduced_refs_no_overlap.aln.treefile -t PARS
    
Also note that this is...kind of a ridiculous tree.  You've got ___ NORW query sequences and only 100 reference sequences, which seems unfair.  Oh dear, it's taking forever to run...

Finally, it is in figtree /home/ubuntu/20220217/allseqs_no_overlap.aln.treefile.

A fun thing to do with this, now that it's done, is to try treetime mugration again, with the more interesting regions now.  Hey, this is what you set out to do in the bullet points at the top anyway.

Quick quick, treetime:

#get list of leaf tips
grep "England" /home/ubuntu/20220217/allseqs_no_overlap.aln | sed 's/>//g'  > /home/ubuntu/20220217/leaftips.txt

#The file "'/home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.tsv'" has been squashed a bit and needs some reformatting.

     1) location_cogukage --> location_coguk    age
     2) Europe / United Kingdom / England --> Europe/UnitedKingdom/England
     3) make sure header is tab-separated
     
     sed -i 's/ \/ /\//g;s/location_cogukage/location_coguk    age/;s/United\s\+Kingdom/UnitedKingdom/;s/\s\+/\t/g' /home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.tsv
     
python3
readfile = '/home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.tsv'
df = pd.read_csv(readfile, header=0, sep='\t', dtype='str')
df = df.rename(columns={'strain':'name'}) #header required by treetime

#include only the names that match the names in the alignment
leaf_df = pd. read_csv('/home/ubuntu/20220217/leaftips.txt', header=None)
leaf_df = leaf_df.rename(columns={0:'name'})

#add dates to leaf_df, and format the names like in the tree
save_df = leaf_df.merge(df, how='left')
save_df['name'] = save_df['name'].str.replace('/', '_')
print(save_df)

save_df.to_csv('/home/ubuntu/20220217/treetime_dates.tsv', sep='\t', index=False)

Once you've done that, make this into .nf format so it will be easier to run.  Or a bash script.  Either one.  Nextflow is nice because it should deal with intermediate files for you and because it can run things in parallel/specify threads.


Try this again:
 
     sed 's/ \/ /\//g;s/location_cogukage/location_coguk    age/;1s/\s\+/\t/g' /home/ubuntu/20220217/peroba_meta.220209_173253.England.20210701_20220228.tsv > fritest2.tsv
     
This works!

Now remake treetime dates quickly....done.

now the leaf tips and sequence names don't add up, so quickly change '/' to '_' in the alignment file:
sed -i 's/\//_/g' /home/ubuntu/20220217/allseqs_no_overlap.aln

Treetime on default:

! treetime --tree /home/ubuntu/20220217/allseqs_no_overlap.aln.treefile --dates /home/ubuntu/20220217/treetime_dates.tsv --aln /home/ubuntu/20220217/allseqs_no_overlap.aln --outdir /home/ubuntu/20220217/timetree

Treetime with mugration:

add this bit to the code that gets treetime dates:
df['country'] = df['strain'].str.split('/').str[0]
df['region'] = df['strain'].str.split('/').str[1]
df['region'] = df['region'].str.split('-').str[0]

run this: 
python /home/ubuntu/scripts/get_treetime_dates.py --metadata_tsv fritest2.tsv --leaftips /home/ubuntu/20220217/leaftips.txt --out /home/ubuntu/20220217/treetime_dates_2.tsv

now run a mugration model:
! treetime mugration --tree /home/ubuntu/20220217/allseqs_no_overlap.aln.treefile --states /home/ubuntu/20220217/treetime_dates_2.tsv --attribute 'region' --outdir /home/ubuntu/20220217/timetree-mugration

The output tree is not annotated, I suspect because so many branch lengths are 0.  How could it infer the region if everything is 0?  To fix this, I will try again with more diverse sequences.  Maybe the 100 most diverse NORW, and the 100 most diverse neighbours.  For now, though, go eat something (done!) and then write up the streamlining.  Then retry.  Then, once you have a tree, visualize it.

--rapidnj of NORW, iqtree -k 100
--rapidnj of nn possibilities, k -100
--concatenate, reanalyse


**Appendix of trying out gotree prune yet again and still failing**

Quickly just check if I had goalign prune backwards:

    nohup gotree prune -r --tipfile /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre -i /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt -o /home/ubuntu/20220217/reduced_refs_no_overlap.tre
    
Nope, that does nothing.  Try it forwards again?

    nohup gotree prune -r --tipfile /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt  -i /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre -o /home/ubuntu/20220217/reduced_refs_no_overlap.tre
    
This says that "The tree after tip removal is only made of two tips" again.  Hmm...this isn't true because I'm asking it to keep 100 tips.  The names are formatted the same in the .tre and the .txt.
    
If I drop the --reverse, I get a real tree:
    
    gotree prune --tipfile /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt  -i /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre -o /home/ubuntu/20220217/reduced_refs_no_overlap.tre

Last thing to try is dropping the --reverse and changing `seqs_to_keep` to `seqs_to_prune` for the input:
    
    #get seqs_to_prune
    
    grep -vw -f /home/ubuntu/20220217/seqs_to_keep_no_overlap.txt /home/ubuntu/20220217/unique_neighbours_2.txt | sed 's/>//' > /home/ubuntu/20220217/seqs_to_prune.txt   #2162 records
    
    gotree prune --tipfile /home/ubuntu/20220217/seqs_to_prune.txt  -i /home/ubuntu/20220217/NORW.4nn.no_overlap.aln.tre -o /home/ubuntu/20220217/reduced_refs_no_overlap.tre #2262 records at the end, same number as if it hadn't been pruned at all....so weird
    
Okay, so I'm leaving goalign prune and going with iqtree.

**Congratulations!  Preprocessing and all that is done.**

P.S.  To linearize a fasta file:
    
`awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.aln | sed '1d' > /home/ubuntu/20220217/peroba_align.220209_173253.England.20210701_20220228.unique.1.aln`


`awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.aln | sed '1d' > /home/ubuntu/20220217/peroba_align.220209_173253.NORW.20210901_20211230.1.aln`

https://www.biostars.org/p/9262/