Note: This is a cleaner version than the data exploring one.

# Worldwide CF Data Analysis

We used a Google BigQuery to identify all the CF metagenomes, and we've analysed them here.

Initially, we compared the subsystems content, and that shows that our metagenomes are similar to other CF metagenomes. However, it did not make a great figure.

We developed a MinHash-based approach, where we calculate all pairwise distances, and use them to demonstrate that the Adelaide CF dataset is no different from other data.

We identify all the metagenomes available in the SRA.

First, go to [the NCBI BioProject pages for Cystic Fibrosis](https://www.ncbi.nlm.nih.gov/bioproject/?term=cystic+fibrosis) and choose *Send To:* and then *File* and *Accessions List*. That saves a list of one bioproject per line (_Note:_ Don't use Bioproject ID list. The list you want should have items starting PRJEB).

We upload that to Google Console as a new file in the table called `bioproject_accs`.

Next, we run this search.

```
create temp table AMPLICON(acc STRING) as select acc as amplicon from `nih-sra-datastore.sra.metadata` where assay_type = 'AMPLICON' or libraryselection = 'PCR';
# sra-searches.cystic_fibrosis.bioproject_accs is downloaded from NCBI BioProject https://www.ncbi.nlm.nih.gov/bioproject/?term=cystic+fibrosis and uploaded as a table
create temp table BIOPROJ(bioproject STRING) as SELECT string_field_0 FROM `sra-searches.cystic_fibrosis.bioproject_accs` WHERE string_field_0 IS NOT NULL;
select * from `nih-sra-datastore.sra.metadata` where acc not in (select acc from AMPLICON) and bioproject in (select bioproject from BIOPROJ) and (librarysource = "METAGENOMIC" or librarysource = 'METATRANSCRIPTOMIC' or organism like "%microbiom%" OR organism like "%metagenom%");
```

When I ran this time (18/09/2024), there were 6,467 results returned. I save those as a newline separated bigquery results file (current version: [bq-results-20240918-082050-1726647675307.json.gz](bq-results-20240918-082050-1726647675307.json)).



In [1]:
# load our libraries
import os
import sys

import re
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

from matplotlib.collections import PatchCollection
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import pandas as pd
import seaborn as sns
import json
from natsort import natsorted

In [None]:
if not os.path.exists(os.path.join("Adelaide", "subsystems", "level2_norm_ss.tsv.gz")):
    print(f"We are not in the right path. Trying to change!")
    os.chdir("/home/edwa0468/GitHubs/CF_Data_Analysis/WorldWideDataAnalysis")

adl = pd.read_csv("Adelaide/subsystems/level2_norm_ss.tsv.gz", compression='gzip', delimiter="\t")
adl = adl.set_index("Unnamed: 0")
adl = adl.reindex(natsorted(adl.columns), axis=1)
adl = adl.T
adl['geo_loc_name_country_calc'] = "Adelaide"
adl.head(3)

# Read the pairwise MinHash Distances for all the samples

In these samples, we have pre-filtered some of the SRA runs to remove any additional runs that are mostly 16S sequences.

In [2]:
msh = pd.read_csv("OtherSequences/pairwise_mash_distances.no16s.tsv.gz", compression='gzip', delimiter="\t")
msh

Unnamed: 0,From,To,Distance,p-value,kmers
0,SRR10267760,SRR10267760,0.000000,0.000000e+00,1000/1000
1,SRR10267761,SRR10267760,0.122140,9.205380e-140,40/1000
2,SRR10267762,SRR10267760,0.098138,9.371560e-257,68/1000
3,SRR10267763,SRR10267760,0.109331,2.501330e-204,53/1000
4,SRR10267765,SRR10267760,0.128270,2.206470e-136,35/1000
...,...,...,...,...,...
736159,895293_20180502_S,983493_20180123_S,0.098138,4.563390e-237,68/1000
736160,896213_20180427_S,983493_20180123_S,0.154223,9.761470e-69,20/1000
736161,913873_20180417_S,983493_20180123_S,0.178173,6.960310e-42,12/1000
736162,980574_20180403_S,983493_20180123_S,0.111073,7.693610e-173,51/1000
