# Compare 6m and 2y auspice haplotypes; design library
Using the 6m and 2y Nextstrain trees, we identified all `derived haplotypes` and their representative strains.
Now we will compare those haplotypes. 
We will keep the 6m representative strains and toss the 2y strains. 

Then, we will compare to Nextclade metadata.
First I'll purge all haplotypes from the Nextclade set that haven't been sampled in 2023. 
Then I'll merge haplotype record count with the Nextstrain dataset. 
We will purge any 2y haplotypes that have no record from 2023.

This particular year there were only **37 Nexstrain derived haplotypes** that matched the given sampling requirements. A `derived haplotype` is a Nextstrain-specific piece of jargon that is assigned to an HA1 haplotype (a subclade with additional amino acid mutations called on that background, for example, a J.2 virus with S145N mutaiton would be called 'J.2:S145N') that has achieved some threshold of child/descendent sequences over a specified timeline. 

To more fully take advantage of the space in our library, we decided to use the Nextclade dataset to add additional haplotypes that had not achieved this threshold, and therefore were not called as `derived haplotypes` in the Nextstrain dataset.
We opted to simply make a list of the highest frequency haplotypes in the Nextclade dataset, then remove those haplotypes already sampled by our Nextstrain analysis (ie, remove all `derived haplotypes`).
We chose representative strains for each of those haplotypes by selecting the most recent strain (ie, most recent collection date) per haplotype. 
This **Nextclade method identified 25 additional haplotypes**.

Overall, we selected **62 total representative strains**.

Then, we will finalize the list of haplotypes and representative strains. 
I will output this as a large dataframe and a list of GISAID accession numbers. 

In [1]:
import os
import pandas as pd

In [2]:
resultsdir = '../results'
datadir = '../data'
os.makedirs(resultsdir, exist_ok=True)
os.makedirs(datadir, exist_ok=True)

In [3]:
# [Phylogenetic method]
# My inputs are 2 TSV outputs  because I ran on the 2y (two year) and 6m (six month) trees for H3N2
files = {'two_year': 'representative_strains_per_haplotype_2y.tsv', 
        'six_month': 'representative_strains_per_haplotype_6m.tsv'}

# Code for identifying clade overlap 
strains_df = pd.DataFrame()
for key in files: 
    # Import emerging clades output
    df = pd.read_csv(os.path.join(datadir, 'auspice_haplotypes', files[key]), sep='\t')
    df['method'] = key
    # Select minimum branch length strains
    strains_df = pd.concat([strains_df, df[['name', 'div', 'num_date',
    'clade_membership', 'subclade', 'haplotype', 'method']]])

print('Here is some basic info on the Auspice/Phylogenetic method-selected strains...')
# All haploytpes
haplotypes = strains_df['haplotype'].tolist()
print(f'the number output (6m + 2y) haplotypes is... {len(haplotypes)}')
# Unique haplotypes
unique_haplotypes = [i for i in haplotypes if haplotypes.count(i)==1]
print(f'the number of unique haplotypes occurring is... {len(unique_haplotypes)}')
# Haplotypes that appear both in 2y and 6m trees
overlapping_haplotypes = list(set(haplotypes).difference(unique_haplotypes))
print(f'the number of overlapping haplotypes is... {len(overlapping_haplotypes)}')

Here is some basic info on the Auspice/Phylogenetic method-selected strains...
the number output (6m + 2y) haplotypes is... 99
the number of unique haplotypes occurring is... 67
the number of overlapping haplotypes is... 16


In [4]:
# Subselect original dataframe on overlapping haplotypes
overlapping_haplotypes_df = (strains_df[strains_df['haplotype']
    .isin(overlapping_haplotypes)]
    .query('method == "six_month"'))
# Save list of overlapping haplotypes
file = '6m_2y_overlapping_haplotypes.csv'
overlapping_haplotypes_df.to_csv(os.path.join(resultsdir, file))
print('Here are the overlapping haplotypes:')
print(overlapping_haplotypes_df['haplotype'].tolist())

# Subselect original dataframe on unique haplotypes
unique_haplotypes_df = (strains_df[strains_df['haplotype']
    .isin(unique_haplotypes)])


# Add overlapping and unique dataframes back together
auspice_strains_df = pd.concat([overlapping_haplotypes_df, unique_haplotypes_df]).reset_index(drop=True)
print('\nNow we can compare the number of strains from 6-month and 2-year, pre and post filtering.')
print('We should have kept all 6 month strains, and only a subselection of 2-year, if there was overlap.')
print('\nHere is the input:')
print('initial 6 month strains = ', len(strains_df.query('method == "six_month"')))
print('initial 2 year strains = ', len(strains_df.query('method == "two_year"')))
print('\nAfter filtering:')
print('the 6 month strains remaining = ', len(auspice_strains_df.query('method == "six_month"')))
print('the 2 year strains remaining = ', len(auspice_strains_df.query('method == "two_year"')))
print('therefore, the number of unique [Phylogenetic method] strains = ', 
      len(auspice_strains_df.query('method == "six_month"')) 
      + len(auspice_strains_df.query('method == "two_year"')))

auspice_strains_df

Here are the overlapping haplotypes:
['G.1.1.2', 'G.1.1.2:326R', 'G.1.3.1.1', 'G.1.3.1.1:122D', 'G.1.3.1.1:122D,144N,276E', 'G.1.3.1.1:122D,276E', 'G.1.3.1.1:124N,145N', 'G.1.3.1.1:145N', 'G.1.3.1.1:173R,276E', 'G.1.3.1.1:21T', 'G.1.3.1.1:242M', 'G.1.3.1.1:25V', 'G.1.3.2', 'G.2', 'G.2.1', 'G.2:242M']

Now we can compare the number of strains from 6-month and 2-year, pre and post filtering.
We should have kept all 6 month strains, and only a subselection of 2-year, if there was overlap.

Here is the input:
initial 6 month strains =  31
initial 2 year strains =  68

After filtering:
the 6 month strains remaining =  31
the 2 year strains remaining =  52
therefore, the number of unique [Phylogenetic method] strains =  83


Unnamed: 0,name,div,num_date,clade_membership,subclade,haplotype,method
0,A/Finland/391/2023,0.033588,2023.513699,2a.1b,G.1.1.2,G.1.1.2,six_month
1,A/Kanagawa/IC2239/2023,0.034155,2023.486301,2a.1b,G.1.1.2,G.1.1.2:326R,six_month
2,A/Singapore/NUH0526/2023,0.038133,2023.450685,2a.3a.1,G.1.3.1.1,G.1.3.1.1,six_month
3,A/Sydney/332/2023,0.039833,2023.450685,2a.3a.1,G.1.3.1.1,G.1.3.1.1:122D,six_month
4,A/Bangkok/P3755/2023,0.040966,2023.683562,2a.3a.1,G.1.3.1.1,"G.1.3.1.1:122D,144N,276E",six_month
...,...,...,...,...,...,...,...
78,A/Netherlands/1685/2023,0.040403,2023.686301,2a.3a.1,G.1.3.1.1,G.1.3.1.1:5E,six_month
79,A/SouthAfrica/R06477/2023,0.039833,2023.417808,2a.3a.1,G.1.3.1.1,G.1.3.1.1:5V,six_month
80,A/Oman/3011/2023,0.039266,2023.543836,2a.3a.1,G.1.3.1.1,G.1.3.1.1:78D,six_month
81,A/California/81/2023,0.039266,2023.628767,2a.3a.1,G.1.3.1.1,G.1.3.1.1:81D,six_month


In [5]:
# [NextClade method]
# My input is a single TSV, generated with help from John Huddleston (Bedford lab).
# All outputs were generated from a metadata file generated on November 21st by John.

metafile = 'representative_strains_per_HA1_haplotype_nextclade.tsv'

nextclade_strains_df = pd.DataFrame()
# Import emerging clades output
df = pd.read_csv(os.path.join(datadir,'nextclade_haplotypes', metafile), sep='\t')
df['rankClass'] = key
# Select minimum branch length strains
nextclade_strains_df = pd.concat([nextclade_strains_df, df[['seqName', 'short-clade', 'subclade',
                                                            'aaSubstitutions', 'HA1_substitutions', 
                                                            'accession_ha', 'date', 'date_submitted', 
                                                            'region', 'country', 'division', 'location', 
                                                            'passage_category', 'originating_lab', 
                                                            'submitting_lab', 'HA1_haplotype_count']]])


nextclade_strains_df.head()

Unnamed: 0,seqName,short-clade,subclade,aaSubstitutions,HA1_substitutions,accession_ha,date,date_submitted,region,country,division,location,passage_category,originating_lab,submitting_lab,HA1_haplotype_count
0,A/Portugal/210259/2022,2a.1,G.1.1,"HA1:D104G,HA1:K276R","HA1:D104G,HA1:K276R",EPI2139262,2022-02-17,2022-08-31,Europe,Portugal,Portugal,Portugal,cell,Instituto Nacional De Saude (insa),Crick Worldwide Influenza Centre,4423
1,A/Ontario/12/2022,2a.1b,G.1.1.2,"HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K","HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K",EPI2337996,2022-11-09,2023-02-01,North America,Canada,Ontario,Ontario,unpassaged,Public Health Ontario,B.c. Centre For Disease Control,2708
2,A/Oklahoma/13632/2022,2b,G.2,"HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H","HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H",EPI2290282,2022-11-17,2023-01-12,North America,Usa,Oklahoma,Oklahoma,unpassaged,U.s. Air Force School Of Aerospace Medicine,U.s. Air Force School Of Aerospace Medicine,2124
3,A/Madrid/230625/2022,2b,G.2.2,"HA1:R33Q,HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,...","HA1:R33Q,HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,...",EPI2478169,2022-12-14,2023-03-27,Europe,Spain,Madrid,Madrid,unpassaged,Instituto De Salud Carlos Iii,Instituto De Salud Carlos Iii,1882
4,A/Roraima/2022-020834-IEC/2022,2a.3,G.1.3,"HA1:G53N,HA1:N96S,HA1:I192F,HA2:N49S,HA2:N160T","HA1:G53N,HA1:N96S,HA1:I192F",EPI2438576,2022-09-27,2023-03-03,South America,Brazil,Roraima,Roraima,unpassaged,Evandro Chagas Institute,Evandro Chagas Institute,1642


In [6]:
# Merge haplotype years and counts with metadata
metadata_file = 'nextclade_metadata_h3n2_2023-11-21.tsv'
metadata = pd.read_csv(os.path.join(datadir, metadata_file), sep='\t')

# Rename columns and add year
metadata = (metadata.assign(year = lambda x: x['date'].str.split('-').str[0].astype(int))
           .rename(columns={'name': 'seqName'})
           )

# Merge counts
counted_metadata = metadata.merge(nextclade_strains_df[['HA1_substitutions', 'HA1_haplotype_count']])
print('here is an array of unique haplotype counts in 2022-2023...')
print(counted_metadata['HA1_haplotype_count'].sort_values(ascending=False).unique())
print('and here is a sample of the metadata file with counts...')
counted_metadata.head()

here is an array of unique haplotype counts in 2022-2023...
[4423 2708 2124 1882 1642 1626 1503  929  665  442  396  355  334  290
  281  274  255  247  208  166  164  158  140  139  138  120  102   92
   91   90   88   87   85   84   82   81   80   79   70   67   63   58
   57   56   53   52   51   50   47   45   44   43   42   40   38   37
   36   35   34   33   32   31   30   29   28   27   26   25   24   23
   22   21   20   19   18   17   16   15   14   13   12   11   10    9
    8    7    6    5    4    3    2    1]
and here is a sample of the metadata file with counts...


Unnamed: 0,seqName,short-clade,subclade,aaSubstitutions,HA1_substitutions,accession_ha,date,date_submitted,region,country,division,location,passage_category,originating_lab,submitting_lab,age,gender,accession_na,year,HA1_haplotype_count
0,A/Catalonia/NSVH198262131/2022,2a.3a.1,H,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F...","HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F...",EPI2260033,2022-12-15,2022-12-23,Europe,Spain,Catalonia,Catalonia,unpassaged,Hospital Universitari Germans Trias I Pujol,Hospital Universitari Vall D'hebron,?,?,EPI2260032,2022,1626
1,A/Portugal/210259/2022,2a.1,G.1.1,"HA1:D104G,HA1:K276R","HA1:D104G,HA1:K276R",EPI2139262,2022-02-17,2022-08-31,Europe,Portugal,Portugal,Portugal,cell,Instituto Nacional De Saude (insa),Crick Worldwide Influenza Centre,48y,female,EPI2139347,2022,4423
2,A/NewJersey/13618/2022,2b,G.2.1,"HA1:E50K,HA1:G53D,HA1:F79V,HA1:T135A,HA1:I140K...","HA1:E50K,HA1:G53D,HA1:F79V,HA1:T135A,HA1:I140K...",EPI2290191,2022-11-10,2023-01-12,North America,Usa,New Jersey,New Jersey,unpassaged,U.s. Air Force School Of Aerospace Medicine,U.s. Air Force School Of Aerospace Medicine,34y,male,EPI2290190,2022,1503
3,A/Oklahoma/13632/2022,2b,G.2,"HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H","HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H",EPI2290282,2022-11-17,2023-01-12,North America,Usa,Oklahoma,Oklahoma,unpassaged,U.s. Air Force School Of Aerospace Medicine,U.s. Air Force School Of Aerospace Medicine,30y,male,EPI2290281,2022,2124
4,A/Brazil/BA-LACEN-BA113-292060529/2022,2a.3,G.1.3,"HA1:G53N,HA1:N96S,HA1:Y100C,HA1:I192F,HA2:N49S...","HA1:G53N,HA1:N96S,HA1:Y100C,HA1:I192F",EPI2185964,2022-01-06,2022-09-27,South America,Brazil,Brazil,Brazil,undetermined,Laboratório Central De Saúde Pública Professor...,Laborat?rio Central De Sa?de Publica Do Estado...,?,?,EPI2185963,2022,1


In [7]:
# Remove pre-2022 records
print('number of 2022 records = ', len(counted_metadata.query('year == 2022')))
print('number of 2023 records = ',len(counted_metadata.query('year == 2023')))

recent_counted_metadata = (counted_metadata
                           .query('year == 2023')
                           .reset_index(drop=True)
                           .rename(columns={'seqName': 'name'})
                          )

print('here are the sampled clades :', counted_metadata['short-clade'].unique())
print('and here is an array of unique haplotype counts only in 2023...')
print(recent_counted_metadata['HA1_haplotype_count'].sort_values(ascending=False).unique())

number of 2022 records =  28790
number of 2023 records =  5552
here are the sampled clades : ['2a.3a.1' '2a.1' '2b' '2a.3' '1a.1' '2a.1a' '2a.1b' '2a.3a' '2a.2' '2a'
 '3C.2a1b.1a' '2c' '3C.2a1b.1b' '2a.3b' '2' '1' '3C.3a']
and here is an array of unique haplotype counts only in 2023...
[4423 2708 2124 1882 1642 1626 1503  665  442  396  355  334  281  274
  255  247  208  164  158  139  138  102   91   90   88   82   81   80
   79   70   63   58   53   51   50   47   44   43   40   36   33   31
   30   29   28   27   26   25   24   23   22   21   20   19   18   17
   16   15   14   13   12   11   10    9    8    7    6    5    4    3
    2    1]


In [8]:
# Now merge with auspice 
auspice_nextclade_merged_df = (auspice_strains_df
                               .drop(columns = ['subclade'])
    .merge(recent_counted_metadata[['name', 'subclade', 'date', 'date_submitted', 'HA1_substitutions', 'accession_ha', 'HA1_haplotype_count']], how='left', suffixes = ('_auspice2y', '_nextclade'), on = ['name'])
    .dropna()
    .reset_index(drop=True)
    )
# Metadata contains a weird typo where 'EPI' in HA accession is sometimes repeated
auspice_nextclade_merged_df['accession_ha'] = auspice_nextclade_merged_df['accession_ha'].str.replace('EPIEPI','EPI')

# Write out results
outfile = os.path.join(resultsdir, 'auspice_haplotypes_with_nextclade_counts.csv')
print(f'writing merged metadata to {outfile}...')
auspice_nextclade_merged_df.to_csv(outfile, index=False)
auspice_nextclade_merged_df
# ['name', 'HA1_substitutions']

writing merged metadata to ../results/auspice_haplotypes_with_nextclade_counts.csv...


Unnamed: 0,name,div,num_date,clade_membership,haplotype,method,subclade,date,date_submitted,HA1_substitutions,accession_ha,HA1_haplotype_count
0,A/Finland/391/2023,0.033588,2023.513699,2a.1b,G.1.1.2,six_month,G.1.1.2,2023-07-07,2023-09-08,"HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K",EPI2736643,2708.0
1,A/Kanagawa/IC2239/2023,0.034155,2023.486301,2a.1b,G.1.1.2:326R,six_month,G.1.1.2,2023-06-27,2023-07-25,"HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K,HA1:K326R",EPI2647253,11.0
2,A/Singapore/NUH0526/2023,0.038133,2023.450685,2a.3a.1,G.1.3.1.1,six_month,H,2023-06-14,2023-07-10,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F...",EPI2617594,1626.0
3,A/Sydney/332/2023,0.039833,2023.450685,2a.3a.1,G.1.3.1.1:122D,six_month,H,2023-06-14,2023-07-24,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",EPI2643073,274.0
4,A/Bangkok/P3755/2023,0.040966,2023.683562,2a.3a.1,"G.1.3.1.1:122D,144N,276E",six_month,H,2023-09-07,2023-09-28,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",EPI2760070,10.0
5,A/Ontario/RV00796/2023,0.0404,2023.563014,2a.3a.1,"G.1.3.1.1:122D,276E",six_month,H,2023-07-25,2023-09-07,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",EPI2735889,70.0
6,A/Krabi/THIS050/2023,0.04267,2023.645205,2a.3a.1,"G.1.3.1.1:124N,145N",six_month,H,2023-08-24,2023-09-28,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:S124N,HA1:I140K...",EPI2759946,20.0
7,A/Saskatchewan/SKFLU317847/2023,0.039833,2023.617808,2a.3a.1,G.1.3.1.1:145N,six_month,H,2023-08-14,2023-09-11,"HA1:T28A,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,...",EPI2743304,1.0
8,A/StPetersburg/MH144113/2023,0.042105,2023.626027,2a.3a.1,"G.1.3.1.1:173R,276E",six_month,H,2023-08-17,2023-10-24,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:Q173R...",EPI2777879,28.0
9,A/Maldives/852/2023,0.039266,2023.420548,2a.3a.1,G.1.3.1.1:21T,six_month,H,2023-06-03,2023-08-17,"HA1:P21T,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,...",EPI2689677,18.0


## Adding new HA1 subtitutions
The above method only found 37 haplotypes. We have more room in this library and want to include new mutations.

To do that, I'll identify all HA1 substitutions sampled within the last 6 months that have a 2022-2023 record count of greater than 5. 
I'll compare these haplotypes to my auspice-nextclade method, detailed above.
I suspect I'll have overlap, but also identify new HA1 substitutions. 

To pick representative strains for any new HA1 substitutions, I'll choose the most recent sampled strain. 

In [9]:
# Rename columns and add year
counted_metadata = counted_metadata.assign(year_month_day = lambda x: x['date'].str.split('-').str[0:3])
counted_metadata['yearmonthday'] = [''.join(map(str, l)) for l in counted_metadata['year_month_day']]
counted_metadata = counted_metadata.drop(columns = ['year_month_day'])

# Count uncertain dates
print('some of the month dates are input as "X", so we will have to handle those differently')
count = len(counted_metadata.query('yearmonthday == "2023XXXX"').HA1_substitutions.unique().tolist())
print(f'here are the {count} unsure 2023 HA1 substitutions:')
print(counted_metadata.query('yearmonthday == "2023XXXX"').HA1_substitutions.unique())


# ID uncertain 2023 metadata and handle separately 
uncertain_2023_metadata = counted_metadata.query('yearmonthday == "2023XXXX"')
print('\nhere are submitted dates for uncertain 2023 sequences')
print(uncertain_2023_metadata.date_submitted.unique().tolist())
uncertain_2023_metadata = uncertain_2023_metadata.assign(
    month = lambda x: x['date_submitted'].str.split('-').str[1].astype(int)
)

print('\nthe number of uncertain 2023 HA1 haplotypes SUBMITTED in past 6 months...')
uncertain_past_six = (uncertain_2023_metadata[uncertain_2023_metadata.month >= 6])
print(len(uncertain_2023_metadata['HA1_substitutions'].unique().tolist()))

print('\nthe number of uncertain 2023 HA1 haplotypes SUBMITTED in past 6 months with >= 5 counts...')
uncertain_greater_that_five = uncertain_past_six[uncertain_past_six.HA1_haplotype_count >= 5]
print(len(uncertain_greater_that_five['HA1_substitutions'].unique().tolist()))


# Subtract uncertain 2022 and 2023 dates from counted_metadata
num_counted_metadata = counted_metadata[pd.to_numeric(counted_metadata['yearmonthday'], errors='coerce').notnull()]
num_counted_metadata = num_counted_metadata.assign(
    yearmonthday = lambda x: x['yearmonthday'].astype(int)
)

print('\nthe number of HA1 haplotypes in past 6 months...')
past_six = num_counted_metadata[num_counted_metadata.yearmonthday >= 20230600]
print(len(past_six['HA1_substitutions'].unique().tolist()))

print('\nthe number of HA1 haplotypes in past 6 months with >= 5 counts...')
greater_that_five = past_six[past_six.HA1_haplotype_count >= 5]
print(len(greater_that_five['HA1_substitutions'].unique().tolist()))

some of the month dates are input as "X", so we will have to handle those differently
here are the 15 unsure 2023 HA1 substitutions:
['HA1:I25V,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:R150K,HA1:I192F,HA1:I223V'
 'HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F,HA1:I223V'
 'HA1:R33Q,HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H,HA1:S262N'
 'HA1:E50K,HA1:G53D,HA1:F79V,HA1:T135A,HA1:I140K,HA1:S156H,HA1:S262N'
 'HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F,HA1:I223V,HA1:I242M'
 'HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H'
 'HA1:I25V,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F,HA1:K207E,HA1:I223V'
 'HA1:I25V,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F,HA1:I223V'
 'HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K,HA1:I192F,HA1:I223V,HA1:K276E'
 'HA1:E50K,HA1:G53N,HA1:N96R,HA1:I140K,HA1:I192F,HA1:I223V'
 'HA1:I25V,HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F,HA1:I223V,HA1:D271E'
 'HA1:E50K,HA1:G53N,HA1:N96S,HA1:A106V,HA1:I140K,HA1:I192F,HA1:I223V'
 'HA1:E50K,HA1:G53N,HA1:N96S,HA1

In [10]:
print('here is a preview of the total unique haplotypes from the combined "Auspice" and "New HA1 sub" methods')
(pd.concat([greater_that_five.rename(columns={'seqName': 'name'}),
            uncertain_greater_that_five.rename(columns={'seqName': 'name'}),
            auspice_nextclade_merged_df])
 [['HA1_substitutions', 'HA1_haplotype_count']]
 .drop_duplicates()
 .reset_index(drop=True)
)

here is a preview of the total unique haplotypes from the combined "Auspice" and "New HA1 sub" methods


Unnamed: 0,HA1_substitutions,HA1_haplotype_count
0,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",70.0
1,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",274.0
2,"HA1:E50K,HA1:G53N,HA1:G78D,HA1:N96S,HA1:I140K,...",28.0
3,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:S145N...",23.0
4,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K...",10.0
...,...,...
57,"HA1:I48T,HA1:G53N,HA1:N96S,HA1:Q173R,HA1:I192F",5.0
58,"HA1:E50K,HA1:G53D,HA1:F79V,HA1:T135A,HA1:I140K...",28.0
59,"HA1:E50K,HA1:G53D,HA1:F79V,HA1:D101E,HA1:I140K...",247.0
60,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F...",1.0


In [11]:
# List of auspice-method HA1 mutations already represented in library
auspice_HA1_muts = auspice_nextclade_merged_df['HA1_substitutions'].tolist()

# Concat uncertain-2023 and 2023 dataframes
all_HA1_substitutions = (pd.concat([greater_that_five.rename(columns={'seqName': 'name'}),
                                    uncertain_greater_that_five.rename(columns={'seqName': 'name'})])
                         .drop_duplicates()
                         .drop(columns=['month'])
                         .reset_index(drop=True)
                        )

new_subs_df = all_HA1_substitutions[~all_HA1_substitutions.HA1_substitutions.isin(auspice_HA1_muts)]

# ID new unqiue HA1 substitutions
new_HA_list = new_subs_df.HA1_substitutions.unique().tolist()
print(f'this method identifies {len(new_HA_list)} new HA1 substitutions...')

# Save representative strains with most recent collection date 
new_HA_strains_df = pd.DataFrame()

for ha in new_HA_list:
    df = new_subs_df.query(f'HA1_substitutions == "{ha}"')
    df = df[pd.to_numeric(df['yearmonthday'], errors='coerce').notnull()]

    # Identify most recent collection date
    latest_date = (df.yearmonthday.max())

    if ha == 'HA1:D7G,HA1:E50K,HA1:G53D,HA1:F79V,HA1:I140K,HA1:S156H':
        print(ha, df)

    # Select strain with most recent date
    representative_strain_df = df.query(f'yearmonthday == {latest_date}').dropna()
    
    if (len(representative_strain_df)) != 1:
        print(f'multiple strains were collected on this date, {latest_date}!')
        
    new_HA_strains_df = pd.concat([new_HA_strains_df, representative_strain_df])
    
print('here are some the representative strains for each substitution, chosen by latest collection date...')
new_HA_strains_df = (new_HA_strains_df
                     .reset_index(drop=True)
                     .assign(method = 'newHAsubs')
                    )
new_HA_strains_df.name.unique().tolist()

this method identifies 25 new HA1 substitutions...
here are some the representative strains for each substitution, chosen by latest collection date...


['A/Sydney/710/2023',
 'A/Netherlands/1760/2023',
 'A/SouthAfrica/KO56863/2023',
 'A/SouthAfrica/R06240/2023',
 'A/Finland/399/2023',
 'A/SouthAfrica/PET28931/2023',
 'A/SouthAfrica/R06359/2023',
 'A/SouthAfrica/R06506/2023',
 'A/Sendai/45/2023',
 'A/Townsville/68/2023',
 'A/Bangkok/P3599/2023',
 'A/SouthAfrica/K056872/2023',
 'A/Okinawa/234/2023',
 'A/Guangdong-Futian/1980/2023',
 'A/SouthSudan/642/2023',
 'A/Ehime/50/2023',
 'A/Chipata/15-NIC-007/2023',
 'A/SchleswigHolstein/4/2023',
 'A/Catalonia/NSVH102124476/2023',
 'A/AbuDhabi/6753/2023',
 'A/Kanagawa/AC2316/2023',
 'A/Brisbane/429/2023',
 'A/SouthAfrica/R07073/2023',
 'A/SouthSudan/631/2023',
 'A/Southafrica/R07876/2020']

In [12]:
# Combine everything
auspice_and_newHAsubs_df = (pd.concat([auspice_nextclade_merged_df, new_HA_strains_df])
                            .reset_index(drop=True)
                            [['name', 'subclade', 'method', 'date', 'date_submitted', 
                              'accession_ha','HA1_haplotype_count', 'HA1_substitutions']]
                           )
auspice_and_newHAsubs_df['accession_ha'] = auspice_and_newHAsubs_df['accession_ha'].str.replace('EPIEPI','EPI')


outfile = os.path.join(resultsdir, 'auspice_and_newHAsubs_library_strains.csv')
print(f'writing merged metadata to {outfile}...')
auspice_and_newHAsubs_df.to_csv(outfile, index=False)
auspice_and_newHAsubs_df

writing merged metadata to ../results/auspice_and_newHAsubs_library_strains.csv...


Unnamed: 0,name,subclade,method,date,date_submitted,accession_ha,HA1_haplotype_count,HA1_substitutions
0,A/Finland/391/2023,G.1.1.2,six_month,2023-07-07,2023-09-08,EPI2736643,2708.0,"HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K"
1,A/Kanagawa/IC2239/2023,G.1.1.2,six_month,2023-06-27,2023-07-25,EPI2647253,11.0,"HA1:D104G,HA1:I140K,HA1:K276R,HA1:R299K,HA1:K326R"
2,A/Singapore/NUH0526/2023,H,six_month,2023-06-14,2023-07-10,EPI2617594,1626.0,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:I140K,HA1:I192F..."
3,A/Sydney/332/2023,H,six_month,2023-06-14,2023-07-24,EPI2643073,274.0,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K..."
4,A/Bangkok/P3755/2023,H,six_month,2023-09-07,2023-09-28,EPI2760070,10.0,"HA1:E50K,HA1:G53N,HA1:N96S,HA1:N122D,HA1:I140K..."
...,...,...,...,...,...,...,...,...
57,A/Kanagawa/AC2316/2023,G.1.1.2,newHAsubs,2023-10-08,2023-11-01,EPI2781715,10.0,"HA1:D104G,HA1:S124R,HA1:I140K,HA1:K276R,HA1:R299K"
58,A/Brisbane/429/2023,G.1.1.2,newHAsubs,2023-07-13,2023-11-09,EPI2790312,5.0,"HA1:D104G,HA1:I140K,HA1:I214T,HA1:K276R,HA1:R299K"
59,A/SouthAfrica/R07073/2023,G.2,newHAsubs,2023-06-19,2023-09-04,EPI2725091,5.0,"HA1:E50K,HA1:G53N,HA1:F79V,HA1:I140K,HA1:S156H..."
60,A/SouthSudan/631/2023,G.1.3,newHAsubs,2023-06-09,2023-10-03,EPI2760859,8.0,"HA1:G53N,HA1:N96S,HA1:I192F,HA1:K276E"


## Write GISIAD IDs to file
The search function in GISAID accepts tab delimited EPI IDs. Save a file so that these IDs can more easily be copy-pasted into search box. 

In [13]:
# Make out directory
gisaiddir = os.path.join(datadir, 'gisaid_query')
os.makedirs(gisaiddir, exist_ok=True)

# Save just EPI IDs to CSV to make GISAID searching easy
outfile = os.path.join(gisaiddir, 'library_accession_numbers.csv')
library_strain_accession = auspice_and_newHAsubs_df['accession_ha']
print(f'the number of [Auspice/Nextclade + New HA1 subs methods] unique accessions is... {len(library_strain_accession.drop_duplicates())}')
print('here they are: ', library_strain_accession.to_list())
print(f'writing accession numbers to {outfile}...')
with open(outfile, 'w') as f:
    for item in library_strain_accession:
        f.write(item + '\n')

the number of [Auspice/Nextclade + New HA1 subs methods] unique accessions is... 62
here they are:  ['EPI2736643', 'EPI2647253', 'EPI2617594', 'EPI2643073', 'EPI2760070', 'EPI2735889', 'EPI2759946', 'EPI2743304', 'EPI2777879', 'EPI2689677', 'EPI2756864', 'EPI2781036', 'EPI2778022', 'EPI2735103', 'EPI2758716', 'EPI2791811', 'EPI2498763', 'EPI2505314', 'EPI2571554', 'EPI2558520', 'EPI2498814', 'EPI2312127', 'EPI2793114', 'EPI2724875', 'EPI2785896', 'EPI2771080', 'EPI2725146', 'EPI2759881', 'EPI2762684', 'EPI2773303', 'EPI2686112', 'EPI2687991', 'EPI2755805', 'EPI2688989', 'EPI2756778', 'EPI2772585', 'EPI2653147', 'EPI2724869', 'EPI2794043', 'EPI2718764', 'EPI2688666', 'EPI2736695', 'EPI2700761', 'EPI2688685', 'EPI2689110', 'EPI2775314', 'EPI2743993', 'EPI2760054', 'EPI2718582', 'EPI2670258', 'EPI2667823', 'EPI2762970', 'EPI2767381', 'EPI2763780', 'EPI2790753', 'EPI2725748', 'EPI2689865', 'EPI2781715', 'EPI2790312', 'EPI2725091', 'EPI2760859', 'EPI2732515']
writing accession numbers to ..

## Analyze HA1 mutations
It would be potentially helpful to know how many mutatiosn are sampled in our libraries, and in what background. 
I will calculate and save that information here.

In [14]:
# Analyze HA1 mutations
auspice_and_newHAsubs_df['HA1_muts'] = auspice_and_newHAsubs_df['HA1_substitutions'].str.replace('HA1:','')
HA_df = auspice_and_newHAsubs_df[['HA1_muts']]

# Make list of unique HA1 mutations
unique_mutations = []
for item in HA_df['HA1_muts'].to_list():
    for sub in item.split(','):
        if sub not in unique_mutations:
            unique_mutations.append(sub)
print('here are the unique mutations: ', unique_mutations)

# Count unique contexts for each mutation
mutation_contexts = []

for mut in unique_mutations:
    count = 0 
    backgrounds = []
    for item in HA_df['HA1_muts'].to_list():
              
        if mut in item:
            count += 1
            backgrounds.append(item)
        
    mutation_contexts.append([mut, count, backgrounds])
    
mutation_contexts_df = (pd.DataFrame(mutation_contexts, columns = ['mutation', 'count', 'backgrounds'])
                        .sort_values(by=['count'], ascending=False)
                        .reset_index(drop=True)
                        .assign(site = lambda x: x['mutation'].str[1:-1].astype(int))
                       )

outfile = os.path.join(resultsdir, 'mutation_counts_contexts.csv')
print(f'writing mutations to {outfile}...')
mutation_contexts_df.to_csv(outfile, index=False)

here are the unique mutations:  ['D104G', 'I140K', 'K276R', 'R299K', 'K326R', 'E50K', 'G53N', 'N96S', 'I192F', 'I223V', 'N122D', 'S144N', 'K276E', 'S124N', 'S145N', 'T28A', 'Q173R', 'P21T', 'I242M', 'I25V', 'I140M', 'G53D', 'F79V', 'S156H', 'G5E', 'T135A', 'S262N', 'V309I', 'E83K', 'I48T', 'R269K', 'D101E', 'D7G', 'D104N', 'A106V', 'Q327R', 'Q197H', 'I300V', 'S198P', 'K207R', 'K278E', 'G5V', 'G78D', 'N81D', 'Q173L', 'A212T', 'N63D', 'L15I', 'T10M', 'R33Q', 'G275S', 'G275D', 'T128I', 'V112I', 'K278N', 'P289L', 'K121E', 'S124R', 'I214T', 'P239S']
writing mutations to ../results/mutation_counts_contexts.csv...
