### Scope Breast Cancer scRNAseq datasets from GEO
The aim of this notebook is to identify a suitable breast cancer scRNAseq dataset for analysis.

#### Load libraries

In [5]:
import numpy as np
import pandas as pd
import re

#### Import geo-file

In [6]:
# Query in GEO: (("breast cancer") AND "single") AND "Homo sapiens"[porgn:__txid9606] 
# https://www.ncbi.nlm.nih.gov/
# Save query files  

# Load query file
input_file = open(f'gds_result_single-cell.txt','r')
file_input = input_file.read()
input_file.close()

# Split file per dataset
geofile = re.split('\n\n'+'\d*'+'\. ', file_input)

In [7]:
# Generate an object to store the dataset information
class geo_dataset:
    def __init__(self, title, description, organism, type, platform_and_n_samples, accession_num):
        self.title = title,
        self.description = description,
        self.organism = organism, 
        self.data_type = type,
        self.platform_and_nsamples = platform_and_n_samples,
        self.geo_id = accession_num


# Create empty list to append the dataset information
datasets = []

# Split 'files' into datasets
for entry in geofile:
    if 'GSE' in entry.split('Accession:')[1]:
        abstract = []
        if len(entry.split('(Submitter supplied)')[1].split('\n')) >0:
            abstract.append(entry.split('(Submitter supplied)')[1].split('\n')[0])
        else:
            abstract.append('N/A')

        # Create list of datasets
        datasets.append(geo_dataset(title = entry.partition('\n')[0],
                                    description = abstract,
                                    organism = str(entry.split('\nOrganism:\t')[1].split('\n')[0].split('; ')),
                                    type = str(entry.split('\nType:\t\t')[1].split('\n')[0]),
                                    platform_and_n_samples = entry.split('\nType:\t\t')[1].split(':')[1].split('\n')[0],
                                    accession_num = re.findall('GSE\d*', entry)[2]))
        
# Transform into a dataframe
datasets_df = pd.concat([pd.DataFrame.from_dict(dataset.__dict__) for dataset in datasets])
datasets_df.head()

Unnamed: 0,title,description,organism,data_type,platform_and_nsamples,geo_id
0,,[ We examine the transcriptional alterations i...,['Homo sapiens'],Expression profiling by high throughput sequen...,GPL18573 8 Samples,GSE229094
0,Evaluation of breast cancer PDX tumor heteroge...,[ Breast cancer is the most commonly diagnosed...,['Homo sapiens'],Expression profiling by high throughput sequen...,GPL30173 GPL20301 25 Samples,GSE235168
0,HSF1 excludes CD8+ T cells from breast tumors ...,[ Breast cancer cells or tumors underwent RNA-...,['Homo sapiens'],Expression profiling by high throughput sequen...,GPL24676 6 Samples,GSE236835
0,Genome-wide CRISPR screen reveals tumor-intrin...,[ Radiation therapy (RT) is one of the most co...,['Homo sapiens'],Expression profiling by high throughput sequen...,GPL21697 12 Samples,GSE236331
0,Droplet-based bisulfite sequencing for high-th...,[ We present a high-throughput and droplet-bas...,"['Homo sapiens', 'Mus musculus']",Methylation profiling by high throughput seque...,GPL24676 GPL24247 GPL26363 8 Samples,GSE204691


In [8]:
# Save dataset
#datasets_df.to_csv('brca_datasets_df.csv', index = False)

In [9]:
# Filter out datasets from PDXs or cell lines
#exclusion_patterns = ['MCF', 'MDA-MB', 'CRISPR', 'PDX', 'xenograft', ]

In [None]:
# Interesting datasets: GSE180878, GSE225600, GSE190870[not 10X, GSE167036 (is it 10x?), GSE198745 (only 2 samples), GSE123088 (too big), 
# GSE180286, GSE158399, GSE176078(good), GSE161529 (good!), GSE158724 (good!, snRNAseq), 
# I revised up to GSE162726 [do nt include]