# Series Finder  

## Find ordered samples based on real-valued variables in the Metadata

__Import dependencies and load data__

In [None]:
%load_ext rpy2.ipython

In [None]:
%%bash
wget https://cran.r-project.org/src/contrib/rjson_0.2.20.tar.gz
R CMD INSTALL rjson_0.2.20.tar.gz

In [None]:
%%R
library(rjson)

In [None]:
import json
import pandas as pd
from functions import *

experiment_to_terms_f_json = './data/experiment_to_terms.json'
term_name_to_id_f = './data/term_name_to_id.json'
experiments_in_hackathon_data_f = './data/experiments_in_hackathon_data.json'
experiment_to_type_f = './data/experiment_to_type.json'
experiment_to_study_f = './data/experiment_to_study.json'
experiment_to_real_value_terms_f = './data/experiment_to_real_value_terms.json'
experiment_to_runs_f = './data/experiment_to_runs.json'

with open(experiment_to_terms_f_json, 'r') as f:
    sample_to_terms = json.load(f)    
with open(term_name_to_id_f, 'r') as f:
    term_name_to_id = json.load(f)
with open(experiments_in_hackathon_data_f, 'r') as f:
    available = set(json.load(f))
with open(experiment_to_type_f, 'r') as f:
    sample_to_type = json.load(f)
with open(experiment_to_study_f, 'r') as f:
    sample_to_study = json.load(f)
with open(experiment_to_real_value_terms_f, 'r') as f:
    sample_to_real_val = json.load(f)
with open(experiment_to_runs_f, 'r') as f:
    sample_to_runs = json.load(f)

In [None]:
%%R
metadata_file_tsv <- read.table(file = "./data/experiment_to_terms.tsv", header = FALSE, sep = "\t")

__1. Enter your query__  

Enter your target term in place of `'blood'`, your target property in place of `'age'`, and your target unit in place of `None`.

(Note: most samples in the SRA do not have unit information. We advise leaving this as `None` for properties in which the unit is implied (e.g. age is usually expressed in years).

In [None]:
term = 'blood' ## <-- INPUT HERE
target_property = 'age' ## <-- INPUT HERE
target_unit = 'year' ## <-- INPUT HERE

__2. List terms below to remove__  
In the example below, `'disease', 'disease of cellular proliferation'` will be removed from all timepoints

In [None]:
blacklist_terms = set([
    'disease', 
    'disease of cellular proliferation'
]) ## <-- INPUT HERE

__3. Search for ordered samples__

In [None]:
val_to_samples, primary_df = series(term, target_property, sample_to_real_val, sample_to_terms,             
        sample_to_type, sample_to_study, term_name_to_id, blacklist_terms, 
        filter_poor=False, filter_cell_line=True, filter_differentiated=True,
        value_limit=100, target_unit=None)

These are time points that were found:

In [None]:
df = pd.DataFrame(data=[(k,len(v)) for k,v in val_to_samples.items()], columns=[target_property, 'Number of samples'])
df.sort_values(target_property)

__4. Browse other metadata terms that are associated with samples in a given time point__

Enter whether you want to view cases or controls. Assign the following variable to the number corresponding to the timepoint you would like to view:

In [None]:
view_value = 38 ## <-- INPUT HERE

if view_value in val_to_samples:
    samples = list(val_to_samples[view_value])
with open('./data/term-in.json', 'w') as f:
    json.dump(samples, f)
    
print("Displaying data for %d sample with %s=%d" % (len(samples), target_property, view_value))

The following plots the proportion of metadata terms for those terms that appear in at least 10% of the samples in the current subset:

In [None]:
%%R
source("./Metadata_plot.R")
bp

In [None]:
%%R
source("./Metadata_table.R")

In [None]:
%%R
source("./Metadata_piecharts.R")

__5. Produce output file__. 

Enter the filename for which you would like to output these samples:

In [None]:
output_file = 'series_data.csv' ## <- OUTPUT FILE HERE

primary_df.to_csv(output_file)