# geofetch tutorial for processed data

The [GSE185701 data set](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE185701) has about 355 Mb of processed data that contains 57 Supplementary files, so it's a quick download for a test case. Let's take a quick peek at the geofetch version:

In [1]:
geofetch --version

geofetch 0.10.1


To see your CLI options, invoke `geofetch -h`:

Calling geofetch will do 4 tasks: 

1. download all or filtered processed files from `GSE#####` into your geo folder.
2. download all metadata from GEO and store in your metadata folder.
2. produce a PEP-compatible sample table, `PROJECT_NAME_sample_processed.csv` and `PROJECT_NAME_series_processed.csv`, in your metadata folder.
3. produce a PEP-compatible project configuration file, `PROJECT_NAME_sample_processed.yaml` and `PROJECT_NAME_series_processed.yaml`, in your metadata folder.

Complete details about geofetch outputs is cataloged in the [metadata outputs reference](metadata_output.md).

from IPython.core.display import SVG
SVG(filename='logo.svg')

![arguments_outputs.svg](attachment:arguments_outputs.svg)

## Download the data

First, create the metadata for processed data (by adding --processed and --just-metadata):

In [3]:
geofetch -i GSE185701 --processed -n bright_test --just-metadata

Metadata folder: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test
Trying GSE185701 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;200mProcessing accession 1 of 1: 'GSE185701'[0m
--2022-07-08 12:34:57--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gse&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSE.soft’

/home/bnt4me/Virgin     [ <=>                ]   2.82K  --.-KB/s    in 0s      

2022-07-08 12:34:57 (973 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/bright_test/GSE185701_GSE.soft’ saved [2885]

--2022-07-08 12:34:57--  https://www.ncbi.nlm.nih.gov/geo/query/acc.c

In [4]:
ls bright_test

GSE185701_file_list.txt  GSE185701_GSE.soft  GSE185701_GSM.soft  [0m[01;34mPEP_samples[0m


The `.soft` files are the direct output from GEO, which contain all the metadata as stored by GEO, for both the experiment (`_GSE`) and for the individual samples (`_GSM`). Geofetch also produces a `csv` file with the SRA metadata. The filtered version (ending in `_filt`) would contain only the specified subset of the samples if we didn't request them all, but in this case, since we only gave an accession, it is identical to the complete file. Additionally, file_list.txt is downloaded, that contains information about size, type and creation date of all sample files.

Finally, there are the 2 files that make up the PEP: the `_config.yaml` file and the `_annotation.csv` file (for samples and series). Let's see what's in these files now.

In [6]:
cat bright_test/PEP_samples/GSE185701_samples.yaml

# Autogenerated by geofetch

pep_version: 2.1.0
project_name: GSE185701
sample_table: GSE185701_samples.csv

sample_modifiers:
  append:
    output_file_path: FILES
    sample_growth_protocol_ch1: Huh 7 was cultured in Dulbecco’s modified Eagle’s medium (DMEM) (Invitrogen, Carlsbad, CA, USA) containing 10% fetal bovine serum (FBS) (HyClone, Logan, UT, USA) and antibiotics (penicillin and streptomycin, Invitrogen) at 37 °C in 5% CO2.
    
  derive:
    attributes: [output_file_path]
    sources:
      FILES: /{gse}/{file}





There are few important things to note in this file:

* First, see in the PEP that `sample_table` points to the csv file produced by geofetch.
* Second: output_file_path is location of all the files. 
* Third: sample_modifier Sample_growth_protocol_ch1 is constant sample character and is larger then 50 characters so it is deleted from csv file. For large project it can significantly reduced size of the metadata

Now let's look at the first 100 characters of the csv file:

In [7]:
cut -c -100 bright_test/PEP_samples/GSE185701_samples.csv

sample_taxid_ch1,sample_geo_accession,sample_channel_count,sample_instrument_model,biosample,supplem
9606,GSM5621756,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223730,wig files were gen
9606,GSM5621756,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223730,wig files were gen
9606,GSM5621758,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223732,wig files were gen
9606,GSM5621758,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223732,wig files were gen
9606,GSM5621760,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223728,wig files were gen
9606,GSM5621760,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223728,wig files were gen
9606,GSM5621761,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223729,wig files were gen
9606,GSM5621761,1,HiSeq X Ten,https://www.ncbi.nlm.nih.gov/biosample/SAMN22223729,wig files were gen


Now let's download the actual data. This time we will will be downloading data from the [GSE185701 data set](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE185701) .

Let's additionally add few arguments:

* _geo-folder_ (required) - path to the location where processed files have to be saved
* _filter_ argument, to download only _bed_ files  (--filter ".Bed.gz$")
* _data-source_ argument, to download files only from sample location (--data-source samples)

In [8]:
geofetch -i GSE185701 --processed -n bright_test --filter ".bed.gz$" --data-source samples \
--geo-folder /home/bnt4me/Virginia/for_docs/geo

Metadata folder: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter
Trying GSE185701 (not a file) as accession...
Skipped 0 accessions. Starting now.
[38;5;200mProcessing accession 1 of 1: 'GSE185701'[0m
--2022-07-08 12:36:16--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gse&acc=GSE185701&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSE.soft’

/home/bnt4me/Virgin     [ <=>                ]   2.82K  --.-KB/s    in 0s      

2022-07-08 12:36:16 (245 MB/s) - ‘/home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/GSE185701_GSE.soft’ saved [2885]

--2022-07-08 12:36:16--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSE185701&form=text&

Now lets list the folder to see what data is there. And let's see what's in pep files now.

In [9]:
ls /home/bnt4me/Virginia/for_docs/geo/GSE185701

[0m[01;31mGSM5621756_ChIPseq_Huh7_siNC_H3K27ac_summits.bed.gz[0m
[01;31mGSM5621758_ChIPseq_Huh7_siDHX37_H3K27ac_summits.bed.gz[0m
[01;31mGSM5621760_CUTTag_Huh7_DHX37_summits.bed.gz[0m
[01;31mGSM5621761_CUTTag_Huh7_PLRG1_summits.bed.gz[0m


In [14]:
cut -c -100 cat PEP_samples/GSE185701_samples.csv

cut: cat: No such file or directory
sample_platform_id,sample_library_strategy,sample_contact_country,sample_contact_name,sample_contact
GPL20795,ChIP-Seq,China,"Xianghuo,,He",Shanghai,HCC,"transfected with siNC using Lipofectamine RNAiM
GPL20795,ChIP-Seq,China,"Xianghuo,,He",Shanghai,HCC,"transfected with siDHX37 using Lipofectamine RN
GPL20795,OTHER,China,"Xianghuo,,He",Shanghai,HCC,"transfected with Flag-DHX37 lentivirus, renew the 
GPL20795,OTHER,China,"Xianghuo,,He",Shanghai,HCC,untreated,SRA,Huh 7,hg38,Homo sapiens,HiSeq X Ten,h


: 1

In [15]:
cat PEP_samples/GSE185701_samples.yaml

# Autogenerated by geofetch

pep_version: 2.1.0
project_name: GSE185701
sample_table: GSE185701_samples.csv

sample_modifiers:
  append:
    output_file_path: FILES
    sample_growth_protocol_ch1: Huh 7 was cultured in Dulbecco’s modified Eagle’s medium (DMEM) (Invitrogen, Carlsbad, CA, USA) containing 10% fetal bovine serum (FBS) (HyClone, Logan, UT, USA) and antibiotics (penicillin and streptomycin, Invitrogen) at 37 °C in 5% CO2.
    
  derive:
    attributes: [output_file_path]
    sources:
      FILES: /home/bnt4me/Virginia/for_docs/geo/{gse}/{file}





Now we have easy access to this data by using [peppy](http://peppy.databio.org/en/latest/) package in python or [pepr](https://code.databio.org/pepr/) in r in further analysis 