# Objectives

- Identify disease biomarkers.

- Build an inductive model.

# Dataset Selection

- Datasets: 191

- Samples: 29.198

### OpenRefine

- Retrieve _Technology_ data from Gene Expression Omnibus (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/browse/?view=platforms):

  - high-throughput sequencing (64 datasets: 1473 samples)

  - spotted oligonucleotide (35 datasets: 24.119 samples)

  - in situ oligonucleotide (61 datasets: 2987 samples)

  - other (3 datasets: 75 samples)

  - RT-PCR (28 datasets: 544 samples)

- Delete datasets:

  - _Technology_ = other || RT-PCR (https://biosistemika.com/blog/qpcr-microarrays-rna-sequencing-choose-one/)

# Metadata Preprocessing

- Features: 267

### OpenRefine

- Add dataset-level features.

- Delete samples:

   - _organism_ ≠ Homo sapiens (22 samples)

   - _channel_count_ = 2 (189 samples)

- Split and transpose _key:value_ columns.

In [14]:
import pandas as pd


def keep_first_occurrence(row):
    new_row = []
    for value in row:
        new_row.append(value if value not in new_row else None)
    return pd.Series(new_row, index=row.index)


metadata = pd.read_csv('data/PD.csv', dtype=str)

# Keep only the first occurrence of a value for each row
metadata = metadata.apply(keep_first_occurrence, axis=1)

# Remove columns with the same value in all the rows
metadata.drop(metadata.columns[metadata.apply(pd.Series.nunique) == 1], axis=1, inplace=True)

# Remove empty columns
metadata.dropna(axis=1, how='all', inplace=True)

# Save metadata
metadata.to_csv('data/PD.csv', index=False)

### OpenRefine

- Delete irrelevant or redundant features.

- Delete features where missing values ⪆ 98%.

- Merge equivalent features associated to different datasets.

- Rename features in a consistent way.

- Uniform _age_ and _sex_ values.

- Update dataset-level features using sample-level data (*sample_type*, *disease*).

# Values Preprocessing 

In [42]:
import pandas as pd
import glob


# Read expression value tables, deleting whitespaces and replacing ',' or '+' with '/' from miRNA names
df_list = [pd.read_csv(f, converters={0: lambda x: x.replace(' ', '').replace(',', '/').replace('+', '/')}) for f in glob.glob('data/raw/exp/*.csv')]

# Transpose and concatenate data into a single dataframe. Rows indexed by the same miRNA are merged
values = pd.concat(df.set_index(df.columns[0]).groupby(df.columns[0]).aggregate('max').T for df in df_list)

# Remove samples whose id is not included in metadata file
metadata = pd.read_csv('data/PD.csv', index_col='sample_id')
values = values[values.index.isin(metadata.index)]

# Remove empty columns
values.dropna(axis=1, how='all', inplace=True)

# Save expression value table
values.to_csv('data/EXP.csv')

# Save miRNA list and frequency
pd.DataFrame(values.count(axis=0), columns=['samples']).to_csv('data/miRNA.csv')

# Issues

- Dataset integration not recommended.

- Previous experiments:

    - _Disease_ attribute extracted from _Dataset_ table. But datasets also contain healthy samples.

    - If expression values of multiple datasets are not comparable, generated results are biased: value-range => disease.
        
- Standardize _disease_ column.

- Select relevant and frequent miRNA:

    - Suffixes: *, _vX.0, +a703

    - Strange identifier: miRplus
    
    - Multiple: .../.../..., ...+...+..., ...,...,...

- Define graph structure.