# geofetch tutorial for processed data

The [GSE150868 data set](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150868) has about 1.2 gb of processed data that contains 57 Supplementary files, so it's a quick download for a test case. Let's take a quick peek at the geofetch version:

In [1]:
geofetch --version

geofetch 1.0.0


To see your CLI options, invoke `geofetch -h`:

In [2]:
geofetch -h

usage: geofetch [-h] [-V] -i INPUT [-n NAME] [-m METADATA_ROOT]
                [-u METADATA_FOLDER] [--just-metadata] [-r] [--acc-anno]
                [--use-key-subset] [-x] [--config-template CONFIG_TEMPLATE]
                [-k SKIP] [-p] [--stored-in {all,samples,series}]
                [--filter FILTER] [--filter-size FILTER_SIZE] [-g GEO_FOLDER]
                [-b BAM_FOLDER] [-f FQ_FOLDER] [-P PIPELINE_INTERFACES]
                [--silent] [--verbosity V] [--logdev]

Automatic GEO and SRA data downloader

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -i INPUT, --input INPUT
                        required: a GEO (GSE) accession, or a file with a list
                        of GSE numbers
  -n NAME, --name NAME  Specify a project name. Defaults to GSE number
  -m METADATA_ROOT, --metadata-root METADATA_ROOT
                        Specify a parent folder location to store metadat

Calling geofetch will do 4 tasks: 

1. download all or filtered processed files from `GSE#####` into your geo folder.
2. download all metadata from GEO and store in your metadata folder.
2. produce a PEP-compatible sample table, `PROJECT_NAME_sample_processed.csv` and `PROJECT_NAME_series_processed.csv`, in your metadata folder.
3. produce a PEP-compatible project configuration file, `PROJECT_NAME_sample_processed.yaml` and `PROJECT_NAME_series_processed.yaml`, in your metadata folder.

Complete details about geofetch outputs is cataloged in the [metadata outputs reference](metadata_output.md).

## Download the data

First, create the metadata for processed data (by adding --processed and --just-metadata):

In [3]:
geofetch -i GSE150868 --processed -n bright_test --just-metadata

Metadata folder: /home/bnt4me/Virginia/for_docs/bright_test
Trying GSE150868 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;228mProcessing accession 1 of 1: 'GSE150868'[0m
Found previous GSE file: /home/bnt4me/Virginia/for_docs/bright_test/GSE150868_GSE.soft
Found previous GSM file: /home/bnt4me/Virginia/for_docs/bright_test/GSE150868_GSM.soft
[38;5;242mFile /home/bnt4me/Virginia/for_docs/bright_test/GSE150868_file_list.txt exists.[0m
Total number of processed SAMPLES files found is: 35
Total number of processed SERIES files found is: 2
Finished processing 1 accession(s)
Expanding metadata list...
Unifying and saving of metadata... 
[92mFile /home/bnt4me/Virginia/for_docs/bright_test/bright_test_annotation_sample_processed.csv has been saved successfully[0m
  Config file: /home/bnt4me/Virginia/for_docs/bright_test/bright_test_annotation_sample_processed.yaml
Unifying and saving of metadata... 
[92mFile /home/bnt4me/Virginia/for_docs/bright_test/bright_tes

In [4]:
ls bright_test

bright_test_annotation_sample_processed.csv   GSE150868_file_list.txt
bright_test_annotation_sample_processed.yaml  GSE150868_GSE.soft
bright_test_annotation_series_processed.csv   GSE150868_GSM.soft
bright_test_annotation_series_processed.yaml


The `.soft` files are the direct output from GEO, which contain all the metadata as stored by GEO, for both the experiment (`_GSE`) and for the individual samples (`_GSM`). Geofetch also produces a `csv` file with the SRA metadata. The filtered version (ending in `_filt`) would contain only the specified subset of the samples if we didn't request them all, but in this case, since we only gave an accession, it is identical to the complete file. Additionally, file_list.txt is downloaded, that contains information about size, type and creation date of all sample files.

Finally, there are the 2 files that make up the PEP: the `_config.yaml` file and the `_annotation.csv` file (for samples and series). Let's see what's in these files now.

In [5]:
cat bright_test/bright_test_annotation_sample_processed.yaml

pep_version: 2.0.0
project_name: bright_test
sample_table: /home/bnt4me/Virginia/for_docs/bright_test/bright_test_annotation_sample_processed.csv

sample_modifiers:
  append:
    output_file_path: FILES
  derive:
    attributes: [output_file_path]
    sources:
      FILES: /{SRA}/{sample_name}




There are two important things to note in his file: First, see in the PEP that `sample_table` points to the csv file produced by geofetch. Second: output_file_path is location of all the files.

Now let's look at the first 100 characters of the csv file:

In [8]:
cut -c -100 bright_test/bright_test_annotation_sample_processed.csv

GSE,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,S
GSE150868,11244,GSM4559921,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,22412,GSM4559922,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,23207,GSM4559923,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,25963,GSM4559924,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,26774,GSM4559925,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,26996,GSM4559926,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapien
GSE150868,04H055,GSM4559927,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapie
GSE150868,04H112,GSM4559928,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,Homo sapie
GSE150868,05H111,GSM4559929,Public on Feb 06 2022,May 19 2020,Feb 07 2022,SRA,1,AML cells,H

Now let's download the actual data. This time we will will be downloading data from the [GSE165049 data set](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE165049) .
Let's additionally add few arguments:
- _geo-folder_ (required) - path to the location where processed files have to be saved
- _filter_ argument, to download only _bed_ files  (--filter ".Bed.gz$") 
- _stored-in_ argument, to download files only from sample location (--stored-in samples)

In [10]:
geofetch -i GSE165049 --processed -n bright_test --filter ".Bed.gz$" --stored-in samples \
--geo-folder /home/bnt4me/Virginia/for_docs/geo

Metadata folder: /home/bnt4me/Virginia/for_docs/bright_test
Trying GSE165049 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;228mProcessing accession 1 of 1: 'GSE165049'[0m
--2022-02-21 22:50:41--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gse&acc=GSE165049&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/for_docs/bright_test/GSE165049_GSE.soft’

/home/bnt4me/Virgin     [ <=>                ]   4.48K  --.-KB/s    in 0s      

2022-02-21 22:50:42 (49.6 MB/s) - ‘/home/bnt4me/Virginia/for_docs/bright_test/GSE165049_GSE.soft’ saved [4587]

--2022-02-21 22:50:42--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSE165049&form=text&view=full
Resolving www.ncbi.nlm.nih.g

==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/samples/GSM5025nnn/GSM5025274/suppl ... done.
==> SIZE GSM5025274_CBS-u2-RH-H3K27me3-ChIP-seq_peaks.bed.gz ... 629612
==> EPSV ... done.    ==> RETR GSM5025274_CBS-u2-RH-H3K27me3-ChIP-seq_peaks.bed.gz ... done.
Length: 629612 (615K) (unauthoritative)


2022-02-21 22:50:57 (3.52 MB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025274_CBS-u2-RH-H3K27me3-ChIP-seq_peaks.bed.gz’ saved [629612]

[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025274_CBS-u2-RH-H3K27me3-ChIP-seq_peaks.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-02-21 22:50:58--  ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5025nnn/GSM5025275/suppl/GSM5025275_MYC-3CBS-RH-H3K27me3-ChIP-seq_peaks.bed.gz
           => ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025275_MYC-3CBS-RH-H3K27me3-ChIP-seq_peaks.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::10, 

Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::10, 2607:f220:41f:250::228, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/samples/GSM5025nnn/GSM5025281/suppl ... done.
==> SIZE GSM5025281_HOTTIP-KO-SA1-ChIP-seq_peaks.bed.gz ... 309422
==> EPSV ... done.    ==> RETR GSM5025281_HOTTIP-KO-SA1-ChIP-seq_peaks.bed.gz ... done.
Length: 309422 (302K) (unauthoritative)


2022-02-21 22:51:07 (2.54 MB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025281_HOTTIP-KO-SA1-ChIP-seq_peaks.bed.gz’ saved [309422]

[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025281_HOTTIP-KO-SA1-ChIP-seq_peaks.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-02-21 22:51:07--  ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5025nnn/GSM5025282/suppl/GSM502528

           => ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025288_captureC_Catenin-peaks.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::10, 2607:f220:41f:250::228, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/samples/GSM5025nnn/GSM5025288/suppl ... done.
==> SIZE GSM5025288_captureC_Catenin-peaks.bed.gz ... 19939
==> EPSV ... done.    ==> RETR GSM5025288_captureC_Catenin-peaks.bed.gz ... done.
Length: 19939 (19K) (unauthoritative)


2022-02-21 22:51:17 (573 KB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025288_captureC_Catenin-peaks.bed.gz’ saved [19939]

[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5025288_captureC_Catenin-peaks.bed.gz has been downloaded successfully[0m
[38;5;242m
--2022-02-21 22:51:17--  ftp

[38;5;242m
--2022-02-21 22:51:25--  ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM5081nnn/GSM5081973/suppl/GSM5081973_WT-LK-ChIRP-seq_peaks.bed.gz
           => ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5081973_WT-LK-ChIRP-seq_peaks.bed.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 2607:f220:41e:250::11, 2607:f220:41e:250::13, 130.14.250.13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /geo/samples/GSM5081nnn/GSM5081973/suppl ... done.
==> SIZE GSM5081973_WT-LK-ChIRP-seq_peaks.bed.gz ... 42634
==> EPSV ... done.    ==> RETR GSM5081973_WT-LK-ChIRP-seq_peaks.bed.gz ... done.
Length: 42634 (42K) (unauthoritative)


2022-02-21 22:51:25 (896 KB/s) - ‘/home/bnt4me/Virginia/for_docs/geo/GSE165049/GSM5081973_WT-LK-ChIRP-seq_peaks.bed.gz’ saved [42634]

[38;5;242m0[0m
[0m
[92mFile /home/bnt4me/Virginia/fo

Now lets list the folder to see what data is there. And let's see what's in pep files now.

In [13]:
ls /home/bnt4me/Virginia/for_docs/geo/GSE165049

[0m[01;31mGSM5025270_CBS-u2-KO-H3K4me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025271_CBS-u2-RH-H3K4me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025272_MYC-3CBS-RH-H3K4me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025273_CBS-u2-KO-H3K27me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025274_CBS-u2-RH-H3K27me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025275_MYC-3CBS-RH-H3K27me3-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025276_WT-CTCF-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025277_HOTTIP-KO-CTCF-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025278_WT-RAD21-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025279_HOTTIP-KO-RAD21-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025280_WT-SA1-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025281_HOTTIP-KO-SA1-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025282_WT-SA2-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025283_HOTTIP-KO-SA2-ChIP-seq_peaks.bed.gz[0m
[01;31mGSM5025284_captureC_HOXA-peaks.bed.gz[0m
[01;31mGSM5025285_captureC_HOTTIP-peaks.bed.gz[0m
[01;31mGSM5025286_captureC_MECOM-peaks.bed.gz[0m
[01;31m

In [14]:
cut -c -100 bright_test/bright_test_annotation_sample_processed.csv

GSE,Sample_title,Sample_geo_accession,Sample_status,Sample_submission_date,Sample_last_update_date,S
GSE165049,CBS-u2-KO-H3K4me3-ChIP-seq,GSM5025270,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1,
GSE165049,CBS-u2-RH-H3K4me3-ChIP-seq,GSM5025271,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1,
GSE165049,MYC-3CBS-RH-H3K4me3-ChIP-seq,GSM5025272,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,
GSE165049,CBS-u2-KO-H3K27me3-ChIP-seq,GSM5025273,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1
GSE165049,CBS-u2-RH-H3K27me3-ChIP-seq,GSM5025274,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1
GSE165049,MYC-3CBS-RH-H3K27me3-ChIP-seq,GSM5025275,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA
GSE165049,WT-CTCF-ChIP-seq,GSM5025276,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1,MOLM13 leu
GSE165049,HOTTIP-KO-CTCF-ChIP-seq,GSM5025277,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1,MOL
GSE165049,WT-RAD21-ChIP-seq,GSM5025278,Public on Feb 18 2022,Jan 18 2021,Feb 18 2022,SRA,1,

In [15]:
cat bright_test/bright_test_annotation_sample_processed.yaml

pep_version: 2.0.0
project_name: bright_test
sample_table: /home/bnt4me/Virginia/for_docs/bright_test/bright_test_annotation_sample_processed.csv

sample_modifiers:
  append:
    output_file_path: FILES
  derive:
    attributes: [output_file_path]
    sources:
      FILES: /home/bnt4me/Virginia/for_docs/geo/{SRA}/{sample_name}




Now we have easy access to this data by using [peppy](http://peppy.databio.org/en/latest/) package in python or [pepr](https://code.databio.org/pepr/) in r in further analysis 