**This notebook is used when adding new samples to the analyses done in the Scatterplot notebook. Here we check which samples from the downloaded data have already been added and which not. It also helps to determine which papers are associated to the new samples.**

For now this notebook is geared for loading the second batch of data. When the third one and subsequent ones arrivea it will be further modified accordingly if needed. An alternative is simply to combine all previous Proj_UID files prior to the latest batch into Proj_UID.csv and run the same code with the required modifications.

#### Check for New Samples after Download

In [1]:
#The first batch of samples (original samples) were processed and matched to bioprojects using the ProjectMatch
#notebook in the Match folder.

#Proj_UID is a csv file with the sample ids and their associated bioproject identifier (beggining with PJRNA).
Orig=open("Proj_UID.csv","r")

#Here we simply load that first batch and get the sample ids. We save them in OrigSample.
OrigSample=[]
for line in Orig:
    OrigSample.append(line.strip("\n").split(";")[0])
    
Orig.close()

In [2]:
#Now we open the Mash distance tab file with all pairwise distances between all samples (regardless of the batch
#of data they are in) .
New=open("NewDistances.tab","r", encoding="utf-8")

#From the file we get all sample ids which had a match sketch (and therefore have Mash distances with the other 
#samples).
NewSet=[]
for line in New:
#We parse the file and get sample ids.
    try:
        sample=line.strip("\n").split("\t")[0].split("_")[0].split("/")[1]
#For now we save all sample ids in NewSet.
        if sample not in NewSet:
            NewSet.append(sample)
    except:
        continue

In [3]:
#Now we get the difference between NewSeq and OrigSample. Sample ids present in the former but not the later
#are new samples (belong to the latest batch) and hence need to be processed before they can be analyzed.
NewSamples=list(set(NewSet)-set(OrigSample))

In [4]:
#We'll need to look for the bioproject associated to the new sample ids using Entrez Direct. In both
#cases we'll need a list of the new sample ids. We generate a text file with those ids, NewSampleIds, here.
NewS=open("NewSampleIds.txt","w")

for i in NewSamples:
    NewS.write(i+"\n")
NewS.close()

#### Associate bioproject to new samples after Entrez download

In [1]:
#After using Entrez direct to download the reocrds of all samples, we can move on to associate the bioproject (which is
#within the record) to the samples. First we load the new sample ids (the ones form the latest batch).
NewS=open("NewSampleIds.txt","r")

NewSamples=[]
for line in NewS:
    NewSamples.append(line.strip("\n"))
NewS.close()

In [3]:
#The idea here is as follows: the run record does not have the sample id, but we downloaded the records in the same 
#order as the samples are placed in NewSampleIds, so we simply need to match the first bioproject identifier from the
#first record in NewSraRunInfo.csv (output from Entrez Direct to the first sample, the second to the second, and so on.
NewProj=open("NewSraRunInfo.csv","r")

#We check the lines of NewSraRunInfo and extract the bioproject using a combination of "PJRNA" and commas as data
#delimiters. We save the ordered list of bioprojects in NProj.
NProj=[]
for line in NewProj:
    if "Run,ReleaseDate" in line:
        continue
    line=line.strip("\n").split("PRJNA")[1].split(",")[0]
    NProj.append("PRJNA"+line)
NewProj.close()

In [4]:
#Finally we can generate a Proj_UID csv file for the new samples. We do that there using both NewSamples and 
#NProj. 
Out=open("Proj_UIDBatch2.csv","w")

for (ID,proj)in zip(NewSamples,NProj):
    Out.write(ID+";"+proj+"\n")
    
Out.close()

#It's ideal to check the output file as some bioprojects are not adequately parsed from NewSraRunInfo.csv (will try to 
#correct that later).