# Tutorial: basic wenda_gpu usage

This notebook walks through the basic steps needed to run wenda_gpu on a small, simulated dataset.

In [1]:
from wenda_gpu import wenda_gpu as wg
import pandas as pd
import numpy as np
import os

import warnings
warnings.filterwarnings('ignore')

The variable "prefix" is intended to be a unique identifier for your dataset, which allows you to run wenda_gpu on multiple datasets and have them nested within the same directory structure.

In [2]:
prefix = "simulated_example"

The source and target datasets should each be a matrix where each column is a feature and each row is a sample. Thus, the source and target data files must have the same number of columns, but need not have the same number of rows.

In [3]:
source_x, target_x = wg.load_data(prefix=prefix, data_path="data")

print(source_x)
source_x.shape

[[ 0.6657  0.1712  0.6055 ... -0.3662 -0.079  -0.9831]
 [-0.5039 -0.4653 -0.6866 ...  0.649  -0.2828  0.5439]
 [-0.6244 -0.69    0.0616 ...  1.3983 -0.2632  1.7183]
 ...
 [-1.184  -2.1196 -1.191  ...  2.3368 -0.7823  0.5206]
 [-0.1836  0.6838  0.2268 ... -1.4623  0.2442 -0.2596]
 [-1.7111 -1.7859 -1.6365 ... -0.401   1.1332 -1.9584]]


(1200, 500)

In [4]:
print(target_x)
target_x.shape

[[-0.1909 -1.0174 -0.7121 ...  0.9328 -3.1927  0.2734]
 [ 0.9062  1.445   0.8685 ... -0.1718 -1.825  -0.7571]
 [-1.8884 -1.4984 -1.0887 ... -1.3003  1.2522 -1.1692]
 ...
 [-0.7921 -0.8746 -0.5454 ... -0.3572  2.646   1.5304]
 [-0.1447 -0.4359  0.0337 ...  1.3563 -0.3737 -0.9744]
 [ 1.1162  1.7021  1.4116 ... -0.4561  2.0629  0.1082]]


(1000, 500)

If your data is already stored somewhere without the prefix directory structure or with different file naming conventions, you can simply load the files manually and then convert them to a numpy fortran array, like so:

In [5]:
source_file = "data/simulated_example/source_data.tsv"
source_table = pd.read_csv(source_file, sep="\t", header=None)
source_x = np.asfortranarray(source_table.values)

target_file = "data/simulated_example/target_data.tsv"
target_table = pd.read_csv(target_file, sep="\t", header=None)
target_x = np.asfortranarray(target_table.values)

print(source_x)
source_x.shape

[[ 0.6657  0.1712  0.6055 ... -0.3662 -0.079  -0.9831]
 [-0.5039 -0.4653 -0.6866 ...  0.649  -0.2828  0.5439]
 [-0.6244 -0.69    0.0616 ...  1.3983 -0.2632  1.7183]
 ...
 [-1.184  -2.1196 -1.191  ...  2.3368 -0.7823  0.5206]
 [-0.1836  0.6838  0.2268 ... -1.4623  0.2442 -0.2596]
 [-1.7111 -1.7859 -1.6365 ... -0.401   1.1332 -1.9584]]


(1200, 500)

Now that the data is loaded, we will need to normalize both source and target datasets. Both datasets are normalized based on the source data's distribution to allow direct comparison between the two.

In [6]:
source_x_norm, target_x_norm = wg.normalize_data(source_x, target_x)

source_x_norm

array([[ 0.69694417,  0.21877732,  0.66571408, ..., -0.38984496,
        -0.0504219 , -1.0386572 ],
       [-0.45969857, -0.42033402, -0.64428229, ...,  0.6596134 ,
        -0.25281859,  0.52678521],
       [-0.57886363, -0.64595589,  0.11428073, ...,  1.43419886,
        -0.23335355,  1.73075087],
       ...,
       [-1.13226418, -2.08142104, -1.15566854, ...,  2.40436894,
        -0.74887918,  0.50289863],
       [-0.14294697,  0.73348034,  0.28176885, ..., -1.52293332,
         0.27055263, -0.296943  ],
       [-1.65352477, -1.74635198, -1.60733898, ..., -0.4258193 ,
         1.15343123, -2.03851049]])

With the data loaded and normalized, we are ready to train our feature models. Note that this may take several minutes on the training dataset and up to several hours on larger datasets.

In [7]:
wg.train_feature_models(source_x_norm, target_x_norm, prefix=prefix)

Training models 0 to 99...
Training models 100 to 199...
Training models 200 to 299...
Training models 300 to 399...
Training models 400 to 499...


In [8]:
feature_model_files = os.listdir(os.path.join("feature_models",prefix))
print(feature_model_files[1:5])
len(feature_model_files)

['model_466.pth', 'model_95.pth', 'model_99.pth', 'model_176.pth']


500

In [9]:
confidence_files = os.listdir(os.path.join("confidences",prefix))
print(confidence_files[1:5])
len(confidence_files)

['model_92_confidence.txt', 'model_378_confidence.txt', 'model_488_confidence.txt', 'model_240_confidence.txt']


500

In [10]:
source_y = wg.load_labels(prefix)

source_y

array([[1],
       [0],
       [1],
       ...,
       [0],
       [1],
       [0]])

Again, if your data is stored separately, you can load labels manually and then convert to a numpy fortranarray, like so:

In [11]:
label_file = "data/simulated_example/source_y.tsv"
label_table = pd.read_csv(label_file, header=None)
source_y = np.asfortranarray(label_table)

We now have all the components necessary for running the weighted elastic net. Note that since our labels are binary, we need to set logistic=True to run logistic net regression as opposed to elastic net regression for continuous data. This may take a minute for the example data and several minutes for larger datasets.

In [12]:
wg.train_elastic_net(source_x_norm, source_y, target_x_norm, prefix=prefix, logistic=True, verbose=True)

k_wnet = 0
k_wnet = 1
k_wnet = 2
k_wnet = 3
k_wnet = 4
k_wnet = 6
k_wnet = 8
k_wnet = 10
k_wnet = 14
k_wnet = 18
k_wnet = 25
k_wnet = 35


The predictions of the elastic net model for the target data will be automatically written to the output folder. 

In [13]:
results_file = os.path.join("output",prefix,"k_01/target_predictions.txt")
results = pd.read_csv(results_file, header=None)

print(results)

     0
0    1
1    1
2    0
3    0
4    1
..  ..
995  1
996  1
997  0
998  0
999  1

[1000 rows x 1 columns]


And since we ran logistic regression, we can also see the assignment probability for each sample.
The first column is probability of label 0, second column is probability of label 1.

In [14]:
probability_file = os.path.join("output",prefix,"k_01/target_probabilities.txt")
probabilities = pd.read_csv(probability_file,header=None)

print(probabilities)

                           0
0    1.45078e-01 8.54922e-01
1    3.50972e-02 9.64903e-01
2    9.99732e-01 2.68061e-04
3    9.73876e-01 2.61243e-02
4    4.33515e-01 5.66485e-01
..                       ...
995  2.86878e-01 7.13122e-01
996  1.27318e-01 8.72682e-01
997  9.99943e-01 5.74289e-05
998  5.46992e-01 4.53008e-01
999  6.63956e-04 9.99336e-01

[1000 rows x 1 columns]
