# Series Finder  

## Find ordered samples based on real-valued variables in the metadata

__Import dependencies and load data__

In [2]:
import json
import pandas as pd
from utils import *

__1. Do you have a set of samples that you would like to restrict the retrieval to?__ 

These may be SRA samples that you have preprocessed and/or have access to expressiond data. 

In [3]:
available_data_f = None  ## <-- INPUT HERE

r = load_metadata(available_data_f)
sample_to_terms = r[0]
term_name_to_id = r[1]
sample_to_type = r[2]
sample_to_study = r[3]
sample_to_runs = r[4]
sample_to_real_val = r[5]

__2. Enter your query__  

Enter your target term in place of `'brain'`, your target property in place of `'age'`, and your target unit in place of `None`.

(Note: most samples in the SRA do not have unit information. We advise leaving this as `None` for properties in which the unit is implied (e.g. age is usually expressed in years).

In [4]:
term = 'brain' ## <-- INPUT HERE
target_property = 'age' ## <-- INPUT HERE
target_unit = 'year' ## <-- INPUT HERE

__3. List terms below to remove__  

In the example below, `'disease', 'disease of cellular proliferation'` will be removed from all timepoints

In [5]:
blacklist_terms = set([
    'disease', 
    'disease of cellular proliferation'
]) ## <-- INPUT HERE

__4. Search for ordered samples__

In [7]:
SAVE_FIGURE = 'brain_series.png' ## <-- INPUT

val_to_samples, primary_df = series(term, target_property, sample_to_real_val, sample_to_terms,             
        sample_to_type, sample_to_study, term_name_to_id, blacklist_terms, 
        filter_poor=False, filter_cell_line=True, filter_differentiated=True,
        value_limit=100, target_unit=None)
create_series_plots(val_to_samples, target_property)
if SAVE_FIGURE is not None:
    plt.savefig(SAVE_FIGURE, format='png', dpi=150)
plt.show()

TypeError: list indices must be integers or slices, not str

__5. Browse other metadata terms that are associated with samples in a given time point__

Enter whether you want to view cases or controls. Assign the following variable to the number corresponding to the timepoint you would like to view:

In [11]:
view_value = 38 ## <-- INPUT HERE

if view_value in val_to_samples:
    samples = list(val_to_samples[view_value])
    with open('./data/term-in.json', 'w') as f:
        json.dump(samples, f)
    print("Displaying data for %d sample with %s=%d" % (len(samples), target_property, view_value))
else:
    print("Value {} was not found in the longitudinal query. Please try another query.".format(view_value))

Displaying data for 15 sample with age=38


The following plots the proportion of metadata terms for those terms that appear in at least 10% of the samples in the current subset:

__6. Produce output file__. 

Enter the filename for which you would like to output these samples:

In [15]:
SAVE_FILE = 'series_data.tsv' ## <- OUTPUT FILE HERE

primary_df.to_csv(SAVE_FILE, sep='\t')