# Goals
**[Script]** Select GEO studies from the intitial batch based on favorable study traits:
- **Sufficiently powered (n>4 per group)**
- **Control/Normal/Healthy samples**
- **Sufficient sequencing depth (check spots)**
- **Correct species, tissue/cell type, and data type**

**[Selected studies]:** GSE165322, GSE206529, [ERP104602], [GSE236566]
- **ERP104602:**

**[Note] Although this study is not in GEO, all metadata can still be obtained from `pysradb` and `ffq`** 

# Packages

In [3]:
#########################
### Standard Library ####
#########################
import os
import re
import sys
import json
import math
import warnings
import subprocess
from glob import glob
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor

#####################
### Data Cleaning ###
#####################
import numpy as np
import pandas as pd
import janitor as jn
import VinlandPy as vp

###################
### Public Data ###
###################
import ffq
import GEOparse
from pysradb.sraweb import SRAweb

####################
### Session Info ###
####################
import session_info

## Options

In [None]:
warnings.simplefilter(action="once", category=Warning)

pd.options.display.max_columns = 200
pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 200

## Functions

# Parameters

## Inputs

In [None]:
genome_build = "GRCh38"
release_version = "ensembl.111"

r1_suffix = ""  # _1.fastq.gz
r2_suffix = ""  # _2.fastq.gz

## Outputs

In [None]:
input_path = vp.createDir("./inputs")
# gse_path = vp.createDir(os.path.join(input_path, geo_id))

In [None]:
bio_project_id = "PRJEB22885"
sra_id = "ERP104602"

# Clean metadata
- Each study has a base set of columns to keep plus some additional columns we may want to add
- **Note:** Clean each study one at a time to ensure all desired metadata is collected

# Add metadata
- **Note:** `ffq` has access to all GEO/SRA/EMBL/DDBJ/ENCODE files

### `pysradb`
- **Note:** The GEO sample IDs do not have a dedicated column, and can be found in weird places

In [None]:
db = SRAweb()
df_sra = db.srp_to_srx(sra_id)  # Same as sra_metadata()

sra_base_cols_to_keep = ["study_accession", "run_accession", "library_layout", "run_total_spots", "run_total_bases"]

vp.printDims(df_sra.reorder_columns(sra_base_cols_to_keep), showRows=(None, 1))

# Session info

In [4]:
session_info.show(os=True, std_lib=False, dependencies=False)