On the genbank/bioproject side, here is what I have for to dos (in roughly the order I suggest you tackle them):

go through an example genbank submission to familiarize yourself with the process
review our existing bioprojects (for assemblies) to see if we need to request changes
get a list of our existing bioprojects for assemblies and throw those up in github somewhere

create a list of bioprojects that will need to be created
all samples from Y2-4 will need new bioprojects, I believe
our genbank curator may do this for us. if she does not, we may need instructions for that too.

I seem to remember that creating bioprojects isn't too bad, but linking them can be a pill. That may not be true 
anymore, though.

come up with a list of questions/requests for our genbank curator; things like:
how upload assemblies for samples that already have assemblies?
do we need to update the type of bioproject so it doesn't show "pseudohaplotype" anywhere?

create instructions in confluence for uploading assemblies to genbank
update the figure above in LucidCharts (or whatever) and put that on our GitHub
this one is totally backburner, btw. I think it's a nice-to-have.

In [1]:
import re
import collections
import pandas as pd
from Bio import Entrez
import numpy as np
import xml.etree.ElementTree as ET


```
esearch -db bioproject -query PRJNA730822 | efetch -format runinfo -mode xml > PRJNA730822.xml
```

In [13]:
# Set your email (required by NCBI)
Entrez.email = "apblair@ucsc.edu"

def get_bioproject_identifiers(biosample_id):
    handle = Entrez.esearch(db="bioproject", term=biosample_id)
    record = Entrez.read(handle)
    handle.close()
    biosample_id = record['IdList'][0]
    handle = Entrez.esummary(db="bioproject", id=biosample_id)
    summary_record = Entrez.read(handle)
    handle.close()
    bioproj_id = re.findall(r'\bHG\w+', summary_record['DocumentSummarySet']['DocumentSummary'][0]['Project_Title'])
    if bioproj_id == []:
        bioproj_id = re.findall(r'\bNA\w+', summary_record['DocumentSummarySet']['DocumentSummary'][0]['Project_Title'])
    return bioproj_id, summary_record

In [14]:
# Path to your XML file
xml_file = "PRJNA730822.xml"
bioproject_umbrella_id = xml_file.split(".xml")[0]
# Parse the XML file
tree = ET.parse(xml_file)
root = tree.getroot()

In [15]:
bioproject_idref_dict = {bioproject_umbrella_id:         
    [link.find('ProjectIDRef').get('accession') for link in root.findall('.//ProjectLinks/Link')]}

In [18]:
bioproject_corriel_dict = {bioproj_assembly:None for bioproj_assembly in bioproject_idref_dict[bioproject_umbrella_id]}

print(bioproject_corriel_dict)

for bioproject_id in bioproject_idref_dict[xml_file.split(".xml")[0]]:
    bioproj_id,  bioproject_summary_record = get_bioproject_identifiers(bioproject_id)
    if bioproj_id == []:
        print(bioproject_id, bioproj_id, bioproject_summary_record['DocumentSummarySet']['DocumentSummary'][0]['Project_Title'])
    else:
        bioproject_corriel_dict[bioproject_id] = bioproj_id[0]


{'PRJNA727229': None, 'PRJNA727233': None, 'PRJNA727234': None, 'PRJNA727235': None, 'PRJNA727236': None, 'PRJNA727241': None, 'PRJNA727242': None, 'PRJNA727244': None, 'PRJNA727245': None, 'PRJNA727246': None, 'PRJNA727248': None, 'PRJNA727250': None, 'PRJNA727251': None, 'PRJNA727252': None, 'PRJNA727254': None, 'PRJNA727255': None, 'PRJNA727256': None, 'PRJNA727257': None, 'PRJNA727258': None, 'PRJNA727259': None, 'PRJNA727260': None, 'PRJNA727262': None, 'PRJNA727263': None, 'PRJNA727265': None, 'PRJNA727266': None, 'PRJNA727267': None, 'PRJNA727268': None, 'PRJNA727269': None, 'PRJNA727270': None, 'PRJNA730822': None, 'PRJNA731866': None, 'PRJNA731869': None, 'PRJNA731875': None, 'PRJNA731877': None, 'PRJNA731878': None, 'PRJNA731883': None, 'PRJNA731884': None, 'PRJNA731885': None, 'PRJNA731886': None, 'PRJNA731888': None, 'PRJNA731889': None, 'PRJNA731891': None, 'PRJNA731892': None, 'PRJNA731893': None, 'PRJNA731895': None, 'PRJNA731896': None, 'PRJNA731901': None, 'PRJNA731902

In [21]:
bioproject_corriel_dict['PRJNA727270'] = 'HG00735' # This project is a component of the HPRC Assembly

In [23]:
bioproject_corriel_dict = {key: value for key, value in bioproject_corriel_dict.items() if value is not None}

In [55]:
bioproject_corriel_df = pd.DataFrame.from_dict(bioproject_corriel_dict,orient='index')
bioproject_corriel_ids = bioproject_corriel_df[0].tolist()

In [105]:
bioproject_corriel_df['BioProject-PRJNA730822-Umbrella'] = bioproject_corriel_df.index
bioproject_corriel_df.columns=['Corriel-ID','BioProject-PRJNA730822-Umbrella']

In [109]:
bioproject_corriel_df = bioproject_corriel_df.reset_index(drop=True)

In [113]:
bioproject_corriel_df.to_csv('ncbi-bioproject/PRJNA730822/PRJNA730822-BioProject-Umbrella-Corriel-IDs.tsv', sep='\t',index=False)

In [77]:
len(bioproject_corriel_ids)

47

In [118]:
hprc_production_df = pd.read_csv('production/hprc-production-biosample-table-20240409.tsv', sep='\t')

In [117]:
hprc_production_biosample.head()

Unnamed: 0,Sample,Accession,familyID,Subpopulation,Superpopulation,Production Year
0,HG01891,SAMN17861236,BB05,ACB,AFR,YR1
1,HG02486,SAMN17861238,BB55,ACB,AFR,YR1
2,HG02559,SAMN17861239,BB68,ACB,AFR,YR1
3,HG02257,SAMN17861237,BB21,ACB,AFR,YR1
4,HG01358,SAMN17861234,CLM31,CLM,AMR,YR1


In [138]:
collections.Counter(hprc_production_biosample['Production Year'].tolist())

Counter({'YR4': 101, 'YR3': 69, 'YR2': 52, 'YR1': 30})

In [153]:
hprc_production_biosample[hprc_production_biosample['Production Year'].isin(['YR2',\
                                                                             'YR3',\
                                                                             'YR4'])].to_csv('ncbi-bioproject/create-bioproject-identifiers/hprc-production-create-YR2_4-bioproject-20240409.tsv', 
                                                                                             sep='\t', index=False)

In [None]:
hprc_production_biosample[hprc_production_biosample['Production Year'].isin(['YR1'])].Sample.tolist()

In [119]:
hprc_production_assembly_df = hprc_production_df[hprc_production_df['Sample'].isin(bioproject_corriel_ids)]

In [147]:
# HG03471 is not HPRC Production
[sample for sample in hprc_production_biosample[hprc_production_biosample['Production Year'].isin(['YR1'])].Sample.tolist() \
 if sample not in hprc_production_assembly_df['Sample'].tolist()]

['HG03471']

In [120]:
collections.Counter(hprc_production_assembly_df['Production Year'].tolist())

Counter({'YR1': 29})

In [122]:
hprc_production_assembly_df.shape

(29, 6)

In [123]:
hprc_plus_df = pd.read_csv('hprc-plus/hprc-plus-CorriellIDs.tsv',sep='\t')
hprc_plus_ids = hprc_plus_df['HPRC-plus-CorriellIDs'].tolist()

In [124]:
len(hprc_plus_ids)

23

In [129]:
bioproject_corriel_df.head()

Unnamed: 0,Corriel-ID,BioProject-PRJNA730822-Umbrella
0,HG02257,PRJNA727229
1,HG02559,PRJNA727233
2,HG02486,PRJNA727234
3,HG01891,PRJNA727235
4,HG03516,PRJNA727236


In [133]:
assembly_ids_missing_in_hprc_production = bioproject_corriel_df[bioproject_corriel_df['Corriel-ID'].isin([sample for sample in bioproject_corriel_ids if sample not in hprc_production_assembly_df['Sample'].tolist()])]['Corriel-ID'].tolist()

In [134]:
len([sample for sample in assembly_ids_missing_in_hprc_production if sample in hprc_plus_ids])

18

In [135]:
assert len(assembly_ids_missing_in_hprc_production) + hprc_production_assembly_df.shape[0] == len(bioproject_corriel_ids)