# Overview 

This notebook describes how to fetch all necessary data from GEO databse in GEO query format. This entire workflow is designed to conquer the challenge of multi-omics integration(meta data harmonization, matrix allignment)

Last update: 09/22/2023, Justin Zhang(justinxuan1230@gmail.com)


## Table of Contents
* [Initialization](#Initialization)
* [GSE_Crawler](#GSE_Crawler)
* [GSM_Crawler](#GSM_Crawler)
* [GEO_download](#GEO_download)
* [ENA_download](#ENA_download)
* [bulk_seurat/scanpy](#bulk_seurat/scanpy)


## Install necessary packages

In [None]:
!pip install scrapy --quiet
!pip install scrapydo --quiet
!pip install scrapy-user-agents --quiet

## Initialization

* Construct a GDS search query by any information of interest
- i.e: $\color{Yellow}{\text{"scRNA-seq"[All Fields] AND "lung"[All Fields] AND "Homo sapiens"[Organism] }}$
* Search it in https://www.ncbi.nlm.nih.gov/gds
- i.e: https://www.ncbi.nlm.nih.gov/gds/?term=%22scRNA-seq%22%5BAll+Fields%5D+AND+%22lung%22%5BAll+Fields%5D+AND+%22Homo+sapiens%22%5BOrganism%5D
* Save Unique Identifier as a txt file

## GSE_Crawler


* Crawl GSE information with meta.data filter out
* We use this to extract necessary ENA information and their corresponding GSM information

In [None]:
!python3 gse_crawler.py YOUR_GDS_IDENTIFIER

In [None]:
!python3 ena_crawler.py YOUR_GDS_IDENTIFIER

## GSM_Crawler

In [None]:
!python3 gsm_crawler.py YOUR_GDS_IDENTIFIER

## GEO_download

In [1]:
import pandas as pd
df = pd.read_csv('/Users/justinzhang/Downloads/NCBI_Crawler/0714-ashtma-single-cell/gds_result-ashtma_single-cell_0714_detail_info.csv')


Unnamed: 0,bioproject,citation,contributors,date,download,download_url,gse_alias,organism,overall_design,summary,title
0,PRJNA946745,https://pubmed.ncbi.nlm.nih.gov/37184923,"Yang Y,Li H,Liu P,Zhang X,Wang Q,He H,Cui N,Ti...","Public on Mar 23, 2023",GSE227744_RAW.tar,https://www.ncbi.nlm.nih.gov/geo/download/?acc...,GSE227744,Homo sapiens,We collected bronchoalveolar lavage fluid (BAL...,We present a case of lethal neutrophilic asthm...,Increased mitochondrial metabolism in neutroph...
1,PRJNA798033,https://pubmed.ncbi.nlm.nih.gov/37146132,Smith NP,"Public on Feb 24, 2023",GSE193816_all_data_log_adjusted_counts.h5ad.gz...,ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE193nn...,GSE193816,Homo sapiens,scRNAseq of lung cells captured by endobronchi...,"Using a human model of asthma exacerbation, we...",A model of human asthma exacerbation identifie...
2,PRJNA507463,https://pubmed.ncbi.nlm.nih.gov/31358043,,"Public on Nov 23, 2021",GSE123088_RAW.tar,https://www.ncbi.nlm.nih.gov/geo/download/?acc...,GSE123088,Homo sapiens,Refer to individual Series,This SuperSeries is composed of the SubSeries ...,A validated single-cell-based strategy to iden...
3,PRJNA507464,https://pubmed.ncbi.nlm.nih.gov/31358043,"Gawel DR,Serra-Musach J,Aagesen J,Arenas A,Ask...","Public on Nov 23, 2021",GSE123086_RAW.tar,https://www.ncbi.nlm.nih.gov/geo/download/?acc...,GSE123086,Homo sapiens,Total RNA was extracted using the AllPrep DNA/...,We conducted prospective clinical studies to v...,A validated single-cell-based strategy to iden...
4,PRJNA743074,https://pubmed.ncbi.nlm.nih.gov/34389612,"Tibbitt C,Coquet J","Public on Jul 02, 2021",GSE179292_RAW.tar,https://www.ncbi.nlm.nih.gov/geo/download/?acc...,GSE179292,Homo sapiens,"Herein, we performed scRNA-Seq of CD4 T cells...",Chronic rhinosinusitis with nasal polyps (CRSw...,Single cell analysis of CD4 T cells and an ILC...
5,PRJNA672929,https://pubmed.ncbi.nlm.nih.gov/34006945,"Becker K,Klein H,Simon E,Viollet C,Haslinger C...","Public on Apr 19, 2021",GSE160308_human_retina_DR_smallRNA_counts.txt....,ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nn...,GSE160308,Homo sapiens,Post-mortem human eyes from 43 donors were obt...,Diabetic Retinopathy (DR) is among the major g...,In-depth transcriptomic analyses investigating...
6,PRJNA723562,https://pubmed.ncbi.nlm.nih.gov/33902571,Gomez J,"Public on Apr 22, 2021",GSE172495_RNA_Matrix_PBMC.csv.gz,ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE172nn...,GSE172495,Homo sapiens,PBMCs from five patients with severe asthma an...,Background: Asthma has been associated with im...,SINGLE-CELL CHARACTERIZATION OF A MODEL OF POL...
7,PRJNA672928,https://pubmed.ncbi.nlm.nih.gov/34006945,"Becker K,Klein H,Simon E,Viollet C,Haslinger C...","Public on Apr 19, 2021",GSE160306_human_retina_DR_totalRNA_counts.txt....,ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nn...,GSE160306,Homo sapiens,Post-mortem human eyes from 43 donors were obt...,Diabetic Retinopathy (DR) is among the major g...,In-depth transcriptomic analyses investigating...
8,PRJNA705698,https://pubmed.ncbi.nlm.nih.gov/33833438,"Zemmour D,Charbonnier L,Leon J,Chatila T,Mathi...","Public on Mar 02, 2021","GSE167976_RAW.tar,GSE167976_barcodes.tsv.gz,GS...",ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE167nn...,GSE167976,Homo sapiens,"Human CD4+T cells from IPEX, HD and mothers we...",Single-cell RNAseq (10x Genomics) analysis of ...,Single-cell RNAseq (10x Genomics) analysis of ...
9,PRJNA707555,https://pubmed.ncbi.nlm.nih.gov/33833438,,"Public on Mar 08, 2021",GSE168492_RAW.tar,https://www.ncbi.nlm.nih.gov/geo/download/?acc...,GSE168492,"Homo sapiens,Mus musculus",Refer to individual Series,This SuperSeries is composed of the SubSeries ...,Population and single cell RNAseq analysis of ...


In [2]:
df.download_url

0     https://www.ncbi.nlm.nih.gov/geo/download/?acc...
1     ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE193nn...
2     https://www.ncbi.nlm.nih.gov/geo/download/?acc...
3     https://www.ncbi.nlm.nih.gov/geo/download/?acc...
4     https://www.ncbi.nlm.nih.gov/geo/download/?acc...
5     ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nn...
6     ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE172nn...
7     ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE160nn...
8     ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE167nn...
9     https://www.ncbi.nlm.nih.gov/geo/download/?acc...
10    https://www.ncbi.nlm.nih.gov/geo/download/?acc...
11    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE158nn...
12    https://www.ncbi.nlm.nih.gov/geo/download/?acc...
13    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE145nn...
14    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE146nn...
15                                                  NaN
16    https://www.ncbi.nlm.nih.gov/geo/download/?acc...
17    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1

## ENA_download

1. **Install Aspera Connect**: You will need to download and install the Aspera Connect client on your machine. This client provides the `ascp` command line interface. you can refer to this help page: <https://www.ebi.ac.uk/bioimage-archive/help-download/>

2. **Run the Command**: bash ascrp_download.sh